## RISC-V XBitmanip Extension

Document Version 0.34

Editor: Clifford Wolf Symbiotic GmbH clifford@symbioticeda.com April 20, 2018 Contributors to all versions of the spec in alphabetical order (please contact editors to suggest corrections): Steven Braeger, Rex McCrary, Po-wei Huang, and Clifford Wolf.

This document is released under a Creative Commons Attribution 4.0 International License.

## Contents

| 1 | $\mathbf{Intr}$ | oduction                                       | 1  |
|---|-----------------|------------------------------------------------|----|
|   | 1.1             | ISA Extension Proposal Design Criteria         | 1  |
|   | 1.2             | B Extension Adoption Strategy                  | 2  |
|   | 1.3             | Next steps                                     | 2  |
| 2 | RIS             | C-V XBitmanip Extension                        | 3  |
|   | 2.1             | Count Leading/Trailing Zeros (clz, ctz)        | 3  |
|   | 2.2             | Count Bits Set (pcnt)                          | 4  |
|   | 2.3             | Generalized Reverse (grev, grevi)              | 4  |
|   | 2.4             | Shift Ones (Left/Right) (slo, sloi, sro, sroi) | 6  |
|   | 2.5             | Rotate (Left/Right) (rol, ror, rori)           | 7  |
|   | 2.6             | And-with-complement (andc)                     | 8  |
|   | 2.7             | Bit Extract/Deposit (bext, bdep)               | 8  |
|   | 2.8             | Generalized Bit Permutations (shuffle)         | 10 |
|   | 2.9             | Compressed instructions (c.not, c.neg, c.brev) | 13 |
|   | 2.10            | Pseudo instructions and macros                 | 13 |
|   |                 | 2.10.1 MIX/MUX Macros                          | 14 |
|   |                 | 2.10.2 Bit-field extract and deposit           | 14 |
|   |                 | 2.10.3 Pseudo instructions using grevi         | 15 |
|   |                 | 2.10.4 Pseudo instructions using shuffle       | 16 |
| 3 | Disc            | cussion                                        | 19 |

| ii |     | RISO                                 | C-V | XBit | man | ip I | Exte | ens. | ion V | 70.34 |
|----|-----|--------------------------------------|-----|------|-----|------|------|------|-------|-------|
|    | 3.1 | Frequently Asked Questions           |     |      |     |      |      |      |       | 19    |
|    | 3.2 | Analysis of used encoding space      |     |      |     |      |      |      |       | . 21  |
| 4  | Eva | aluation, Algorithms                 |     |      |     |      |      |      |       | 23    |
|    | 4.1 | Emulating x86 Bit Manipulation ISAs  |     |      |     |      |      |      |       | 23    |
|    | 4.2 | Emulating RI5CY Bit Manipulation ISA |     |      |     |      |      |      |       | 25    |
|    | 4.3 | Decoding RISC-V Immediates           |     |      |     |      |      |      |       | 25    |
|    |     |                                      |     |      |     |      |      |      |       |       |

**29** 

5 Change History

## Chapter 1

## Introduction

This is the RISC-V XBitmanip Extension draft spec. Originally it was the B-Extension draft spec, but the work group got dissolved for burocratic reasons in November 2017.

It is currently an independently maintained document. We'd happily donate it to the RISC-V foundation as starting point for a new B-Extension work group, if there will be one.

## 1.1 ISA Extension Proposal Design Criteria

Any proposed changes to the ISA should be evaluated according to the following criteria.

- Architecture Consistency: Decisions must be consistent with RISC-V philosophy. ISA changes should deviate as little as possible from existing RISC-V standards (such as instruction encodings), and should not re-implement features that are already found in the base specification or other extensions.
- Threshold Metric: The proposal should provide a *significant* savings in terms of clocks or instructions. As a heuristic, any proposal should replace at least four instructions. An instruction that only replaces three may be considered, but only if the frequency of use is very high.
- Data-Driven Value: Usage in real world applications, and corresponding benchmarks showing a performance increase, will contribute to the score of a proposal. A proposal will not be accepted on the merits of its *theoretical* value alone, unless it is used in the real world.
- Hardware Simplicity: Though instructions saved is the primary benefit, proposals that dramatically increase the hardware complexity and area, or are difficult to implement, should be penalized and given extra scrutiny. The final proposals should only be made if a test implementation can be produced.
- Compiler Support: ISA changes that can be natively detected by the compiler, or are already used as intrinsics, will score higher than instructions which do not fit that criteria.

### 1.2 B Extension Adoption Strategy

The overall goal of this extension is pervasive adoption by minimizing potential barriers and ensuring the instructions can mapped to the largest number of ops, either direct or pseudo, that are supported by the most popular processors and compilers. By adding generic instructions and taking advantage of the RISC-V base instruction that already operate on bits, the minimal set of instructions need to added while at the same time enabling a rich of operations.

The instructions cover the four major categories of bit manipulation: Count, Extract, Insert, Swap. The spec supports RV32, RV64, and RV128. "Clever" obscure and/or overly specific instructions are avoided in favor of more straight forward, fast, generic ones. Coordination with other emerging RISC-V ISA extensions groups is required to ensure our instruction sets are architecturally consistent.

### 1.3 Next steps

- Add support for this extension to processor cores and compilers so we can run quantitative evaluations on the instructions.
- Create assembler snippets for common operations that do not map 1:1 to any instruction in this spec, but can be implemented easily using clever combinations of the instructions. Add support for those snippets to compilers.

## Chapter 2

## RISC-V XBitmanip Extension

In the proposals provided in this section, the C code examples are for illustration purposes. They are not optimal implementations, but are intended to specify the desired functionality.

The sections on encodings are mere placeholders.

## 2.1 Count Leading/Trailing Zeros (clz, ctz)

The clz operation counts the number of 0 bits before the first 1 bit (counting from the most significant bit) in the source register. This is related to the "integer logarithm". It takes a single register as input and operates on the entire register. If the input is 0, the output is XLEN. If the input is ~0, the output is 0.

The ctz operation counts the number of 0 bits after the last 1 bit. If the input is 0, the output is XLEN. If the input is  $^{\circ}$ 0, the output is 0.

```
uint_xlen_t clz(uint_xlen_t rs1)
{
    for (int count = 0; count < XLEN; count++)
        if ((rs1 << count) >> (XLEN-1))
            return count;
    return XLEN;
}

uint_xlen_t ctz(uint_xlen_t rs1)
{
    for (int count = 0; count < XLEN; count++)
        if ((rs1 >> count) & 1)
            return count;
    return XLEN;
}
```

| 31 |             | 20 19 | 15 | 14     | 12 11                 | 7 6 | 0       |
|----|-------------|-------|----|--------|-----------------------|-----|---------|
|    | imm[11:0]   | rs1   | L  | funct3 | rd                    | (   | opcode  |
|    | 12          | 5     | •  | 3      | 5                     |     | 7       |
|    | ??????????? | src   | :  | CLZ    | dest                  | 0   | P-IMM   |
|    | ??????????? | src   | :  | CTZ    | dest                  | 0   | P-IMM   |
|    | ??????????? | src   | :  | CLZW   | dest                  | OP- | -IMM-32 |
|    | ??????????? | src   | •  | CTZW   | $\operatorname{dest}$ | OP- | -IMM-32 |

One possible encoding for clz and ctz is as standard I-type opcodes somewhere in the brownfield surrounding the shift-immediate instructions.

### 2.2 Count Bits Set (pcnt)

The purpose of this instruction is to compute the number of 1 bits in a register. It takes a single register as input and operates on the entire register.

This operation counts the total number of set bits in the register.

```
uint_xlen_t pcnt(uint_xlen_t rs1)
{
   int count = 0;
   for (int index = 0; index < XLEN; index++)
        count += (rs1 >> index) & 1;
   return count;
}
```

| 31         | 20 19                | 15 14  | 12 11 7               | 7 6 0     |
|------------|----------------------|--------|-----------------------|-----------|
| imm[11:0]  | rs1                  | funct3 | rd                    | opcode    |
| 12         | 5                    | 3      | 5                     | 7         |
| ?????????? | $\operatorname{src}$ | PCNT   | $\operatorname{dest}$ | OP-IMM    |
| ?????????? | $\operatorname{src}$ | PCNTW  | dest                  | OP-IMM-32 |

One possible encoding for pcnt is as a standard I-type opcode somewhere in the brownfield surrounding the shift-immediate instructions.

## 2.3 Generalized Reverse (grev, grevi)

The purpose of this instruction is to provide a single hardware instruction that can implement all of byte-order swap, bitwise reversal, short-order-swap, word-order-swap (RV64I), nibble-order swap, bitwise reversal in a byte, etc, all from a single hardware instruction. It takes in a single register

value and an immediate that controls which function occurs, through controlling the levels in the recursive tree at which reversals occur.

This operation iteratively checks each bit immed\_i from i=0 to XLEN-1, in XLEN stages, and if the corresponding bit of the 'function\_select' immediate is true for the current stage, swaps each adjacent pair of 2^i bits in the register.

grevi 'butterfly' implementation in C on various architectures

```
uint32_t grev32(uint32_t rs1, int32_t rs2)
{
   uint32_t x = rs1;
   if (rs2 \& 1) x = ((x \& 0x555555555) << 1) | ((x \& 0xAAAAAAAA) >> 1);
   if (rs2 \& 2) x = ((x \& 0x33333333) << 2) | ((x \& 0xCCCCCCCC) >> 2);
   if (rs2 \& 4) x = ((x \& 0x0F0F0F0F) << 4) | ((x \& 0xF0F0F0F0) >> 4);
   if (rs2 \& 8) x = ((x \& 0x00FF00FF) << 8) | ((x \& 0xFF00FF00) >> 8);
   if (rs2 \& 16) x = ((x \& 0x0000FFFF) <<16) | ((x \& 0xFFFF0000) >> 16);
   return x;
}
uint64_t grev64(uint32_t rs1, int32_t rs2)
{
   uint64_t x = rs1;
   if (rs2 \& 4) x = ((x \& 0x0F0F0F0F0F0F0F0F) << 4) | ((x & 0xF0F0F0F0F0F0F0F0F) >> 4);
   if (rs2 & 8) x = ((x & 0x00FF00FF00FF00FF) << 8) | ((x & 0xFF00FF00FF00FF00) >> 8);
   if (rs2 \& 16) x = ((x \& 0x0000FFFF0000FFFF) <<16) | ((x \& 0xFFFF0000FFFF0000) >> 16);
   if (rs2 \& 32) x = ((x \& 0x00000000FFFFFFFF) << 32) | ((x \& 0xFFFFFFFF000000000) >> 32);
   return x;
}
```

The above pattern should be intuitive to understand in order to extend this definition in an obvious manner for RV128+.

| 31 | 1 25                        | 5 24 20               | 0 19                  | 15 14                 | 12 11                 | 7 6       | 0 |
|----|-----------------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------|---|
|    | funct7 rs2                  |                       | rs1                   | funct3                | rd                    | opcode    |   |
|    | 7 5                         |                       | 5                     | 3                     | 5                     | 7         |   |
|    | ???????                     | ${ m src}2$           | $\operatorname{src}1$ | $\operatorname{GREV}$ | $\operatorname{dest}$ | OP        |   |
|    | ???????                     | src2                  | src1                  | GREVW                 | dest                  | OP-32     |   |
|    | 31 27 3                     | 26 20                 | 19 1                  | 5 14                  | 12 11                 | 7 6       | 0 |
|    | imm[11:7]                   | imm[6:0]              | rs1                   | funct3                | rd                    | opcode    |   |
|    | 5                           | 7                     | 5                     | 3                     | 5                     | 7         |   |
|    | ?????                       | mode                  | $\operatorname{src}$  | GREVI                 | dest                  | OP-IMM    |   |
|    | 31 25 3                     | 24 20                 | 19 1                  | 5 14                  | 12 11                 | 7 6       | 0 |
|    | $imm[11:5] \qquad imm[4:0]$ |                       | rs1                   | funct3                | rd                    | opcode    |   |
|    | 7 5                         |                       | 5                     | 3                     | 5                     | 7         |   |
|    | ???????                     | $\operatorname{mode}$ | $\operatorname{src}$  | GREVIW                | $\operatorname{dest}$ | OP-IMM-32 |   |

grev is encoded as standard R-type opcode and grevi is encoded as standard I-type opcode.

## 2.4 Shift Ones (Left/Right) (slo, sloi, sro, sroi)

uint\_xlen\_t slo(uint\_xlen\_t rs1, uint\_xlen\_t rs2)

These instructions are similar to shift-logical operations from the base spec, except instead of shifting in zeros, it shifts in ones. This can be used in mask creation or bit-field insertions, for example.

These instructions are exactly the same as the equivalent logical shift operations, except the shift shifts in ones values.

```
int shamt = rs2 & (XLEN-1);
     return ~(~rs1 << shamt);</pre>
}
uint_xlen_t sro(uint_xlen_t rs1, uint_xlen_t rs2)
     int shamt = rs2 & (XLEN-1);
     return ~(~rs1 >> shamt);
}
                                                15 14
                                                                  12 11
                                                                                  7 6
       funct7
                          rs2
                                         rs1
                                                        funct3
                                                                          rd
                                                                                          opcode
                                                                           5
           7
                           5
                                          5
                                                          3
                                                                                             7
       10?????
                                                         SRO
                                                                                           OP
                          \operatorname{src}2
                                         src1
                                                                         dest
       10?????
                          src2
                                                                         dest
                                                                                           OP
                                         src1
                                                         SLO
       10?????
                          src2
                                         src1
                                                       SROW
                                                                         dest
                                                                                          OP-32
       10?????
                                                       SLOW
                                                                         dest
                                                                                          OP-32
                          src2
                                         src1
                   27 26
                                 20 19
                                                15 14
                                                                  12 11
                                                                                  7 6
                                                                                                      0
      imm[11:7]
                       imm[6:0]
                                         rs1
                                                        funct3
                                                                          rd
                                                                                          opcode
                           7
           5
                                                          3
                                                                           5
                                                                                             7
                                          5
        10???
                         shamt
                                                        SLOI
                                                                         dest
                                                                                        OP-IMM
                                         \operatorname{src}
         10???
                         shamt
                                                        SROI
                                                                         dest
                                                                                         OP-IMM
                                         \operatorname{src}
                                                                                  7 6
   31
                   25 24
                                 20 \ 19
                                                15 \ 14
                                                                  12 11
                                                                                                       0
                       imm[4:0]
      imm[11:5]
                                         rs1
                                                        funct3
                                                                          rd
                                                                                          opcode
                           5
                                                          3
                                                                           5
                                                                                             7
                                          5
       10?????
                         shamt
                                                       SLOIW
                                                                         dest
                                                                                       OP-IMM-32
                                         \operatorname{src}
       10?????
                                                       SROIW
                                                                                       OP-IMM-32
                         shamt
                                                                         dest
                                         \operatorname{src}
```

s(1/r)o(i) is encoded similarly to the logical shifts in the base spec. However, the spec of the entire family of instructions is changed so that the high bit of the instruction indicates the value to

be inserted during a shift. This means that a sloi instruction can be encoded similarly to an slli instruction, but with a 1 in the highest bit of the encoded instruction. This encoding is backwards compatible with the definition for the shifts in the base spec, but allows for simple addition of a ones-insert.

When implementing this circuit, the only change in the ALU over a standard logical shift is that the value shifted in is not zero, but is a 1-bit register value that has been forwarded from the high bit of the instruction decode. This creates the desired behavior on both logical zero-shifts and logical ones-shifts.

## 2.5 Rotate (Left/Right) (rol, ror, rori)

These instructions are similar to shift-logical operations from the base spec, except they shift in the values from the opposite side of the register, in order. This is also called 'circular shift'.

```
uint_xlen_t rol(uint_xlen_t rs1, uint_xlen_t rs2)
{
   int shamt = rs2 & (XLEN-1);
   return (rs1 << shamt) | (rs1 >> (XLEN-shamt));
}

uint_xlen_t ror(uint_xlen_t rs1, uint_xlen_t rs2)
{
   int shamt = rs2 & (XLEN-1);
   return (rs1 >> shamt) | (rs1 << (XLEN-shamt));
}</pre>
```

| 31 | 25     | 24 20                 | 19 15                 | 5 14 12 | 2 11 7                | 6 0    |
|----|--------|-----------------------|-----------------------|---------|-----------------------|--------|
| f  | unct7  | rs2                   | rs1                   | funct3  | rd                    | opcode |
|    | 7      | 5                     | 5                     | 3       | 5                     | 7      |
| 1  | 1????? | $\operatorname{src}2$ | $\operatorname{src}1$ | ROR     | $\operatorname{dest}$ | OP     |
| 1  | 1????? | src2                  | $\operatorname{src}1$ | ROL     | $\operatorname{dest}$ | OP     |
| 1  | 1????? | src2                  | $\operatorname{src}1$ | RORW    | $\operatorname{dest}$ | OP-32  |
| 1  | 1????? | $\operatorname{src}2$ | $\operatorname{src}1$ | ROLW    | $\operatorname{dest}$ | OP-32  |

| 31        | 27 26  | 20 19 | 15                   | 5 14   | 12 11 | 7    | 6      | 0 |
|-----------|--------|-------|----------------------|--------|-------|------|--------|---|
| imm[11:7] | imm[6: | 0]    | rs1                  | funct3 |       | rd   | opcode |   |
| 5         | 7      |       | 5                    | 3      |       | 5    | 7      |   |
| 11???     | sham   | t     | $\operatorname{src}$ | RORI   | (     | lest | OP-IMM |   |

| 31        | 25 24 | 20 19 | 15  | 14 12  | 2 11 $7$ | 6 0       |
|-----------|-------|-------|-----|--------|----------|-----------|
| imm[11:5] | imm[  | 4:0]  | rs1 | funct3 | rd       | opcode    |
| 7         | 5     | •     | 5   | 3      | 5        | 7         |
| 11?????   | shar  | nt    | src | RORIW  | dest     | OP-IMM-32 |

Rotate shift is implemented very similarly to the other shift instructions. One possible way to encode it is to re-use the way that bit 30 in the instruction encoding selects 'arithmetic shift' when bit 31 is zero (signalling a logical-zero shift). We can re-use this so that when bit 31 is set (signalling a logical-ones shift), if bit 31 is also set, then we are doing a rotate. The following table summarizes the behavior. The generalized reverse opcodes can be encoded using the bit pattern that would otherwise encode an "Arithmetic Left Shift" (which is an operation that does not exist).

| Bit 31 | Bit 30 | Meaning             |
|--------|--------|---------------------|
| 0      | 0      | Logical Shift-Zeros |
| 0      | 1      | Arithmetic Shift    |
| 1      | 0      | Logical Shift-Ones  |
| 1      | 1      | Rotate              |

Table 2.1: Rotate Encodings

## 2.6 And-with-complement (andc)

This instruction implements the and-with-complement operation.

```
uint_xlen_t andc(uint_xlen_t rs1, uint_xlen_t rs2)
{
    return rs1 & ~rs2;
}
```

Other with-complement operations (orc, nand, nor, etc) can be implemented by combining not (c.not) with the base ALU operation. (Which can fit in 32 bit when using two compressed instructions.) Only and-with-complement occurs frequently enough to warrant a dedicated instruction.

| 31     | $25\ 24$ | 20   | 19 1                  | 5 14   | 12 11 7 | 6      | 0 |
|--------|----------|------|-----------------------|--------|---------|--------|---|
| funct7 |          | rs2  | rs1                   | funct3 | rd      | opcode |   |
| 7      |          | 5    | 5                     | 3      | 5       | 7      |   |
| ?????? |          | src2 | $\operatorname{src}1$ | ANDC   | dest    | OP     |   |
| ?????? | (        | src2 | $\operatorname{src}1$ | ANDCW  | dest    | OP-32  |   |

## 2.7 Bit Extract/Deposit (bext, bdep)

This instructions implement the generic bit extract and bit deposit functions. This operation is also referred to as bit gather/scatter, bit pack/unpack, parallel extract/deposit, compress/expand, or right\_compress/right\_expand.

BEXT[W] rd,rs1,rs2 collects LSB justified bits to rd from rs1 using extract mask in rs2.

BDEP[W] rd,rs1,rs2 writes LSB justified bits from rs1 to rd using deposit mask in rs2.

```
uint_xlen_t bext(uint_xlen_t v, uint_xlen_t mask)
{
    uint_xlen_t c = 0, m = 1;
    while (mask) {
        uint_xlen_t b = mask & -mask;
        if (v & b)
            c \mid = m;
        mask -= b;
        m <<= 1;
    }
    return c;
}
uint_xlen_t bdep(uint_xlen_t v, uint_xlen_t mask)
    uint_xlen_t c = 0, m = 1;
    while (mask) {
        uint_xlen_t b = mask & -mask;
        if (v & m)
            c |= b;
        mask -= b;
        m <<= 1;
    }
    return c;
}
```

Implementations might chose to use smaller multi-cycle implementations of bext and bdep. Even though multi-cycle bext and bdep often are not fast enough to outperform algorithms that use sequences of shifts and bit masks, dedicated instructions for those operations can still be of great advantage, especially in cases where the mask argument is not constant.

For example, the following code efficiently calculates the index of the tenth set bit in a0 using bdep:

```
li a1, 0x00000200
bdep a0, a1, a0
brev a0, a0
clz a0, a0
```

For cases with a constant mask an optimizing compiler would decide when to use bext or bdep based on the optimization profile for the concrete processor it is optimizing for. This is similar to the decision whether to use MUL or DIV with a constant, or to perform the same operation using a longer sequence of much simpler operations.

| 31     | 25 | 5 24 2                | 0 19                  | $15 \ 14$ | =      | 12 11                 | 7 6    | 0 |
|--------|----|-----------------------|-----------------------|-----------|--------|-----------------------|--------|---|
| funct7 |    | rs2                   | rs1                   |           | funct3 | rd                    | opcode |   |
| 7      |    | 5                     | 5                     | •         | 3      | 5                     | 7      |   |
| ?????? | ?  | $\mathrm{src}2$       | $\operatorname{src}1$ |           | BEXT   | $\operatorname{dest}$ | OP     |   |
| ?????? | ?  | $\operatorname{src}2$ | $\operatorname{src}1$ |           | BDEP   | $\operatorname{dest}$ | OP     |   |
| ?????? | ?  | $\operatorname{src}2$ | $\operatorname{src}1$ | Е         | BEXTW  | $\operatorname{dest}$ | OP-32  |   |
| ?????? | ?  | src2                  | $\operatorname{src}1$ | E         | BDEPW  | $\operatorname{dest}$ | OP-32  |   |

### 2.8 Generalized Bit Permutations (shuffle)

This instruction performs a bit permutation on the value in rs1. Which bit permutation is performed is defined by the control word in rs2 (Table 2.2).

| 64 | 63 | 4      | 18 | 47  | 32 | 31  |                       | 16 | 15 | 12  | 11  | 0    |       |
|----|----|--------|----|-----|----|-----|-----------------------|----|----|-----|-----|------|-------|
|    |    |        | ma | ısk |    |     |                       |    | mo | ode | com | mand | RV128 |
|    |    | unused |    |     | ma | ısk |                       |    | mo | ode | com | mand | RV64  |
|    |    |        |    |     |    |     | $\operatorname{mask}$ |    | mo | de  | com | mand | RV32  |

Table 2.2: shuffle control word

This spec only defines command = 0. Non-zero command values are reserved for future use. An implementation that does not support a given command must return 0 in rd. Support for command 0 is mandatory. Command values 1-7 are reserved for non-standard extensions (NSE).

Command 0 implements functions that are required for computing zip and unzip operations, butterfly stages or entire butterfly networks, omega stages or networks, flip stages or network, and similar operations. Table 2.3 lists the operations performed by command 0. Note that this functions can all use the existing butterfly network that implements the GREV instructions.

Modes that are reserved for future standard or non-standard extensions must return 0 in rd on implementations that do not support those future extensions.

shuffle with rs2=0 (x0) implements just a zip operation (butterfly is disabled because mask=0).

The zip (aka "shuffle") operation interleaves the bits of the lower and upper half of its argument. The unzip (aka "unshuffle") operation performs the inverse.

In other words, zip performs a rotate left shift on the bit indices, and unzip performs a rotate right shift on the bit indices. Performing zip  $log_2(XLEN)$  times is the identity. Performing it  $log_2(XLEN) - 1$  times is equivalent to one execution of unzip.

```
uint_xlen_t zip(uint_xlen_t rs1)
{
    uint_xlen_t x = 0;
    for (int i = 0; i < XLEN/2; i++) {
        x |= ((rs1 >> i) & 1) << (2*i);
}</pre>
```

| Mode | Description                             | Pseudo instructions |
|------|-----------------------------------------|---------------------|
| 0000 | zip + butterfly(mask, 0)                | omega, zip          |
| 0001 | butterfly(mask, 0) + unzip              | flip, unzip         |
| 0010 | reserved for future standard extensions | _                   |
| 0011 | reserved for future standard extensions | _                   |
| 0100 | reserved for future standard extensions | _                   |
| 0101 | reserved for future standard extensions | _                   |
| 0110 | reserved for future standard extensions | _                   |
| 0111 | reserved for future standard extensions | _                   |
| 1000 | butterfly(mask, 0)                      | bfly                |
| 1001 | butterfly(mask, 1)                      | bfly                |
| 1010 | butterfly(mask, 2)                      | bfly                |
| 1011 | butterfly(mask, 3)                      | bfly                |
| 1100 | butterfly(mask, 4)                      | bfly                |
| 1101 | butterfly(mask, 5) (RV64/128; NSE)      | bfly                |
| 1110 | butterfly(mask, 6) (RV128; NSE)         | bfly                |
| 1111 | reserved for future standard extensions | _                   |

Table 2.3: Modes for shuffle command 0

```
x |= ((rs1 >> (i+XLEN/2)) & 1) << (2*i+1);
}
return x;
}

uint_xlen_t unzip(uint_xlen_t rs1)
{
   uint_xlen_t x = 0;
   for (int i = 0; i < XLEN/2; i++) {
        x |= ((rs1 >> (2*i)) & 1) << i;
        x |= ((rs1 >> (2*i+1)) & 1) << (i+XLEN/2);
   }
   return x;
}</pre>
```

The butterfly operation performs a single butterfly stage N (i.e. the grev operation with argument  $2^N$ ), which performs XLEN/2 pairwise bit swaps. But unlike grev the individual bit swaps are conditional and the XLEN/2 bits in mask determine which bit swaps are taken.

```
uint_xlen_t swapbits(uint_xlen_t x, int p, int q)
{
    assert(p < q);
    x = x ^ ((x & (1 << p)) << (q-p));
    x = x ^ ((x & (1 << q)) >> (q-p));
    x = x ^ ((x & (1 << p)) << (q-p));
    return x;
}</pre>
```

```
uint_xlen_t butterfly(uint_xlen_t x, uint_xlen_t mask, int N)
{
    int a = 1 << N, b = 2*a;
    for (int i = 0; i < XLEN/2; i++) {
        int p = b*(i/a) + i\%a, q = p + a;
        if ((mask >> i) & 1)
            x = swapbits(x, p, q);
    }
    return x;
}
Putting it all together:
uint_xlen_t shuffle(uint_xlen_t x, uint_xlen_t ctrl)
{
    uint_xlen_t mask = ctrl >> 16;
    int mode = (ctrl >> 12) & 15;
    int cmd = ctrl & Oxfff;
    if (cmd != 0 || mode > 7+LOG2_XLEN)
        return 0;
    if (mode == 0)
        return butterfly(zip(x), mask, 0);
    if (mode == 1)
        return unzip(butterfly(x, mask, 0));
    if (mode > 7)
        return butterfly(x, mask, mode & 7);
    return 0;
}
```

On RV32, a control word for command 0 can be loaded using a single lui instruction. At most  $2 \cdot log_2(XLEN) - 1$  shuffle operations are required to perform an arbitrary bit permutation. Most bit permutations that arise from real-world applications can be implemented in shorter sequences.

Commands in the range 1-2047 with the upper bits (mask/mode) set to zero can be loaded with a single li instruction. Note that there is no requirement for future non-zero commands to perform bit-permutations or even reversible operations.

| 31      | $25\ 24$ | 20   | 19                    | 15 14    | 12 11                 | 7 6    | 0 |
|---------|----------|------|-----------------------|----------|-----------------------|--------|---|
| funct7  |          | rs2  | rs1                   | funct3   | rd                    | opcode |   |
| 7       |          | 5    | 5                     | 3        | 5                     | 7      |   |
| ??????? |          | src2 | $\operatorname{src}1$ | SHUFFLE  | $\operatorname{dest}$ | OP     |   |
| ??????? |          | src2 | src1                  | SHUFFLEW | V dest                | OP-32  |   |

## 2.9 Compressed instructions (c.not, c.neg, c.brev)

The RISC-V ISA has no dedicated instructions for bitwise inverse (not) and arithmetic inverse (neg). Instead not is implemented as xori rd, rs, -1 and neg is implemented as sub rd, x0, rs.

In bitmanipulation code **not** and **neg** are very common operations. But there are no compressed encodings for those operations because there is no **c.xori** instruction and **c.sub** can not operate on **x0**.

Many bit manipulation operations that have dedicated opecodes in other ISAs must be constructed from smaller atoms in RISC-V XBitmanip code. But implementations might chose to implement them in a single micro-op using macro-op-fusion. For this it can be helpful when the fused sequences are short. **not** and **neg** are good candidates for macro-op-fusion, so it can be helpful to have compressed opcodes for them.

Likewise brev (an alias for grevi rd, rs, -1, i.e. bitwise reversal) is also a very common atom for building bit manipulation operations. So it is helpful to have a compressed opcode for this instruction as well.

The compressed instructions c.not, c.neg, c.brev must be supported by all implementations that support the C extension and XBitmanip.

|   | 15 14 13 | 12        | 11 10           | 9 8             | 7             | 6 5   | 4      | 3 2   |     | 1 0                              |                                  |
|---|----------|-----------|-----------------|-----------------|---------------|-------|--------|-------|-----|----------------------------------|----------------------------------|
|   | 011      | nzimm[9]  | 2               |                 |               | nzimi | n[4 6] | 8:7 5 | [i] | 01                               | C.ADDI16SP $(RES, nzimm=\theta)$ |
|   | 011      | nzimm[17] | rd <sub>7</sub> | $\neq \{0, 2\}$ | nzimm[16:12]  |       |        |       | 01  | C.LUI (RES, nzimm=0; HINT, rd=0) |                                  |
| Ī | 011      | 0         | 00              | rs2'/rc         | $\mathrm{d}'$ | 0     |        |       |     | 01                               | C.NEG                            |
| Ī | 011      | 0         | 01              | rs1'/rc         | $\mathrm{d}'$ | 0     |        |       |     | 01                               | C.NOT                            |
| Γ | 011      | 0         | 10              | rs1'/rc         | $\mathrm{d}'$ | 0     |        |       |     | 01                               | C.BREV                           |
| Ī | 011      | 0         | 11              | _               |               |       | 0      |       |     | 01                               | Reserved                         |

This three instructions fit nicely in the reserved space in C.LUI/C.ADDI16SP. They only occupy 0.1% of the  $\approx 15.6$  bits wide RVC encoding space.

#### 2.10 Pseudo instructions and macros

RISC-V is a RISC instruction set. Unless there is a good argument against it, we try not to assign dedicated opcodes for complex operations that are just easily macro-op fusable short sequences of already existing instructions, especially if compressed instructions can be used to keep the length of those sequences reasonably short.

The assembler should provide pseudo-instructions for some of the sequences that are implemented using dedicated instructions in some CISC bit-maniuplation instruction sets. The sections describing sequences where the assembler should provide pseudo-instructions are titled "pseudo instruction", whereas sections that just informally describe useful sequences are titles "macro".

Many of the code snippets below can utilize compressed instructions. But for simplicity we use uncompressed instructions in the assembler listings. Some of the macros spill to temporary registers. t0, t1, t2, ... is used for spilling in the assembler listings. The input register are referred to by rs1, rs2, ... and the output register is rd.

#### 2.10.1 MIX/MUX Macros

#### **MIX Operation**

A MIX operation selects bits from rs1 and rs2 based on the bits in the control word rs3.

```
and t0, rs1, rs3
andc t1, rs2, rs3
or rd, t0, t1
```

#### **MUX Operation**

A MUX operation selects word rs1 or rs2 based on if the control word rs3 is zero or nonzero, without branching.

```
snez t0, rs3
neg t0, t0
and t1, rs1, t0
andc t2, rs2, t0
or rd, t1, t2
```

Or when rs3 is already either 0 or 1:

```
neg t0, rs3
and t1, rs1, t0
andc t2, rs2, t0
or rd, t1, t2
```

#### 2.10.2 Bit-field extract and deposit

#### Pseudo instruction bfext

Extract the continous bit field starting at pos with length len from rs:

#### Macros for bit-field deposit

Deposit len bits from rs2 at pos in rd, remaining bits in rd are filled from rs1.

Assuming rs1[pos+len-1:pos]=0 and rs2[XLEN-1:pos]=0:

```
slli t0, rs2, pos or rd, rs1, t0
```

Otherwise masking and/or shift operations should be used to clear the extra bits in rs1 and rs2 first. On a machine with fast bdep, the bdep instruction can be used to shift and mask rs2 in one instruction using at the same mask that is also used to mask rs1:

```
li t0, ((1 << len)-1) << pos
andc t1, rs1, t0
bdep t0, rs2, t0
or rd, t0, t1</pre>
```

#### 2.10.3 Pseudo instructions using grevi

On RV32:

```
brev rd, rs
                     grevi rd, rs, 31
                                       ; bitwise reverse
                ->
                     grevi rd, rs, 15
brev.h rd, rs
                ->
                                        ; reverse bits in each 16 bit half-word
brev.b rd, rs
                     grevi rd, rs, 7
                ->
                                        ; reverse bits in each 8 bit byte
bswap rd, rs
                ->
                     grevi rd, rs, 24
                                        ; reverse the byte order
bswap.h rd, rs
                ->
                     grevi rd, rs, 8
                                        ; swap bytes in each 16 bit half-word
hswap rd, rs ->
                     grevi rd, rs, 16
                                        ; swap the two 16 bit half-words
```

On RV64:

```
grevi rd, rs, 63
brev
       rd, rs
                ->
                                         ; bitwise reverse
brev.w rd, rs
                      grevi rd, rs, 31
                                          ; reverse bits in each 32 bit word
                ->
                      grevi rd, rs, 15
brev.h rd, rs
                ->
                                          ; reverse bits in each 16 bit half-word
brev.b rd, rs
                ->
                      grevi rd, rs, 7
                                         ; reverse bits in each 8 bit byte
bswap rd, rs
                ->
                      grevi rd, rs, 56
                                         ; reverse the byte order
bswap.w rd, rs
                ->
                      grevi rd, rs, 24
                                         ; reverse byte order in each 32 bit word
bswap.h rd, rs
                      grevi rd, rs, 8
                ->
                                         ; swap bytes in each 16 bit half-word
hswap
      rd, rs
                ->
                     grevi rd, rs, 48
                                         ; reverse order of 16 bit half-words
```

```
hswap.w rd, rs -> grevi rd, rs, 16 ; swap 16 bit half-words in each 32 bit word wswap rd, rs -> grevi rd, rs, 32 ; swap the two 32 bit words
```

#### 2.10.4 Pseudo instructions using shuffle

#### Pseudo instruction bfly

shuffle with mode[3]=1 (and mode[2:0]  $< log_2(XLEN)$ ) performs a butterfly operation. The assembler should provide a bfly pseudo-instruction for rd  $\neq$  rs and constant mask and N (if N is omitted then N=0 is assumed):

(On RV64, longer sequences are required instead of lui to load the full 32 bit mask into rd.)

For example, an arbitrary RV32 bit permutation using a complete butterfly network:

```
bfly a1, a0, <maskA>, 4
bfly a2, a1, <maskB>, 3
bfly a0, a2, <maskC>, 2
bfly a1, a0, <maskD>, 1
bfly a2, a1, <maskE>, 0
bfly a0, a2, <maskF>, 1
bfly a1, a0, <maskG>, 2
bfly a2, a1, <maskH>, 3
bfly a0, a2, <maskI>, 4
```

Permutations arising from real-world applications can often be implemented using shorter sequences.

#### Pseudo instructions omega and flip

A zip operation followed by a butterfly (0) is commonly known as an *omega stage*.

An unzip operation followed by butterfly( $\log_2(\text{XLEN}) - 1$ )) (or equivialently, butterfly(0) followed by unzip) is commonly known as a *flip stage*. The assember should provide appropriate pseudo-instructions for rd  $\neq$  rs and constant mask:

```
omega rd, rs, mask -> lui rd, (mask << 4) shuffle rd, rs, rd
```

For example, an arbitrary RV32 bit permutation using a complete omega-flip network:

```
omega a1, a0, <maskA>
omega a0, a1, <maskB>
omega a1, a0, <maskC>
omega a0, a1, <maskD>
omega a1, a0, <maskE>
flip a0, a1, <maskF>
flip a1, a0, <maskG>
flip a0, a1, <maskH>
flip a1, a0, <maskI>
flip a1, a0, <maskI>
flip a1, a0, <maskI>
```

Or one instruction shorter by merging the center omega+flip pair into a butterfly operation:

```
omega a1, a0, <maskA>
omega a2, a1, <maskB>
omega a0, a2, <maskC>
omega a1, a0, <maskD>
bfly a2, a1, <maskE> ^ <maskF>, 4
flip a0, a2, <maskG>
flip a1, a0, <maskH>
flip a2, a1, <maskI>
flip a0, a2, <maskJ>
```

As for butterfly networks, permutations arising from real-world applications can often be implemented using a much shorter sequence. Especially if bfly, omega and flip can be mixed arbitrarily.

#### Pseudo instructions zip and unzip

With a zero value as control word, shuffle performs a simple zip operation. The assembler should provide an according pseudo-instruction:

The **zip** instruction with the upper half of its input cleared performs the commonly needed "fanout" operation. (Equivalent to **bdep** with a 0x55555555 mask.) The **zip** instruction applied twice fans out the bits in the lower quarter of the input word by a spacing of 4 bits.

For example, the following code calculates the bitwise prefix sum of the bits in the lower byte of a 32 bit word on RV32:

```
andi a0, a0, 0xff
zip a0, a0
zip a0, a0
slli a1, a0, 4
add a0, a1
slli a1, a0, 8
add a0, a1
slli a1, a0, 16
add a0, a1
```

The final prefix sum is stored in the 8 nibbles of the a0 output word.

## Chapter 3

## Discussion

### 3.1 Frequently Asked Questions

grev seems to be overly complicated? Do we really need it?

The grev instruction can be used to build a wide range of common bit permutation instructions, such as endianess convertion or bit reversal.

If grev were removed from this spec we would need to add a few new instructions in its place for those operations.

#### Do we really need all the \*W opcodes for 32 bit ops on RV64?

I don't know. I think nobody does know at the moment. But they add very little complexity to the core. So the only question is if it is worth the encoding space. We need to run proper experiments with compilers that support those instructions. So they are in for now and if future evaluations show that they are not worth the encoding space then we can still throw them out.

#### Why only and and not any other complement operators?

Early versions of this spec also included other \*c operators. But experiments<sup>1</sup> have show that andc is much more common in bit manipulation code than any other operators. Especially because it is commonly used in mix and mux operations.

#### Why andc? It can easily be emulated using and and not.

Yes, and we did not include any other ALU+complement operators. But andc is so common (mostly because of the mix and mux patterns), and it's implementation is so cheap, that we decided

<sup>&</sup>lt;sup>1</sup>http://svn.clifford.at/handicraft/2017/bitcode/

to dedicate an R-type instruction to the operation.

## The shift-ones instructions can be emulated using not and logical shift? Do we really need it?

Yes, a shift-ones instruction can easily be implemented using the logical shift instructions, with a bitwise invert before and after it. (This is literally the code we are using in the reference C implementation of rotate shift.)

We have decided to include it for now so that we can collect benchmark data before making a final decision on the inclusion or exclusion of those instructions.

#### BEXT/BDEP look like really expensive operations. Do we really need them?

Yes, they are expensive, but not as expensive as one might expect. A single-cycle 32 bit BEXT+BDEP+GREV core can be implemented in less space than a single-cycle 16x16 bit multiplier with 32 bit output.<sup>2</sup>

It is also important to keep in mind that implementing those operations in software is very expensive. Hacker's Delight contains a highly optimized software implementation of 32-bit BEXT that requires > 120 instruction. Their BDEP software implementation requires > 160 instructions. (Please disregard the "hardware-oriented algorithm" described in Hacker's Delight. It is extremely expensive compared to other implementations.<sup>3</sup>)

#### SHUFFLE looks like really expensive operation. Do we really need it?

Even though this instruction looks expensive, it is actually quite simple to implement. The butterfly operation can just reuse the butterfly circuit that is already present to support the grev instruction, and zip and unzip are very cheap to implement (just one additional word-wide mux each). The "return 0" part for nonzero commands and reserved modes is also very cheap.

Considering that SHUFFLE is very cheap to implement ontop an existing GREV implementation, and considering that it only requires a single R-type instruction, and that software emulation of similar functionality requires tens of instructions (and/or multiplications with large "magic constants"), it is a relatively good option.

With dedicated unary ZIP/UNZIP instructions it would be possible to emulate a single SHUFFLE instruction in under 10 instructions. For example, emulating a single OMEGA instruction (input in a0 and mask in a1):

```
grevi t0, a0, 0
zip a1, a1
and t0, t0, a1
```

<sup>&</sup>lt;sup>2</sup>https://github.com/cliffordwolf/bextdep

<sup>&</sup>lt;sup>3</sup>https://github.com/cliffordwolf/bextdep

```
andc a0, a0, a1 or a0, a0, t0 unzip a0, a0
```

Or emulating a single BFLY(2) instruction:

```
grevi t0, a0, 0
zip a1, a1
zip a1, a1
zip a1, a1
and t0, t0, a1
andc a0, a0, a1
or a0, a0, t0
```

This is not too bad, but considering that a single 32-bit permutation takes up to 9 of those, it is probably not a viable option for many bit permutations found in real-world applications.

### 3.2 Analysis of used encoding space

So how much encoding space is used by the XBitmanip extension?

Table 3.1: XBitmanip encoding space (log2, i.e. in equivalent number of bits)

| R              | V32      | RV       | 764      | Instruction                                                |
|----------------|----------|----------|----------|------------------------------------------------------------|
| 3x             | 10       | 6x       | 10       | CLZ, CLZW, CTZ, CTZW, PCNT, PCNTW                          |
| 1x<br>1x       | 15<br>15 |          | 15<br>16 | GREV, GREVW, GREVIW<br>GREVI                               |
| $\frac{1}{2x}$ | 15<br>15 | 6x<br>2x | 15<br>16 | SLO, SRO, SLOW, SROW, SLOIW, SROIW<br>SLOI, SROI           |
| 2x<br>1x       | 15<br>15 | 5x<br>1x | 15<br>16 | ROR, ROL, RORW, ROLW, RORIW<br>RORI                        |
| 4x             | 15       | 4x<br>4x | 15<br>15 | ANDC, BEXT, BDEP, SHUFFLE<br>ANDCW, BEXTW, BDEPW, SHUFFLEW |
| 3x             | 4        | 3x       | 4        | C.NEG, C.NOT, C.BREV                                       |

The compressed encoding space is  $\approx 15.6$  bits wide.

$$log_2(3 \cdot 2^{14}) \approx 15.585$$

The compressed XBitmanip instructions need the equivalent of a 5.6 bit encoding space, or  $\approx 0.1\%$  of the total  $\approx 15.6$  bits available.

$$log_2(3 \cdot 2^4) \approx 5.585$$
  
 $100/(2^{15.585-5.585}) \approx 0.098$ 

On RV32, XBitmanip requires the equivalent of a  $\approx$  18.7 bit encoding space in the uncompressed encoding space. For comparison: A single standard I-type instruction (such as ADDI or SLTIU) requires a 22 bit encoding space. I.e. the entire RV32 XBitmanip extension needs less than one-eighth of the encoding space of the SLTIU instruction.

$$log_2(3 \cdot 2^{10} + 13 \cdot 2^{15}) \approx 18.711$$

On RV64, XBitmanip requires the equivalent of a  $\approx$  19.9 bit encoding space in the uncompressed encoding space. I.e. the entire RV64 XBitmanip extension needs less than one-quarter of the encoding space of the SLTIU instruction.

$$log_2(6 \cdot 2^{10} + 22 \cdot 2^{15} + 4 \cdot 2^{16}) \approx 19.911$$

## Chapter 4

## Evaluation, Algorithms

This chapter contains a collection of short code snippets and algorithms using the XBitmanip extension for evaluation purposes. For the sake of simplicity we assume RV32 for most examples in this chapter.

Most assembler routines in this chapter are written as if they were ABI functions, i.e. arguments are passed in a0, a1, ... and results are returned in a0. Registers t0, t1, ... are used for spilling.

Some of the assembler routines below can not or should not overwrite their first argument. In those cases the arguments are passed in a1, a2, ... and results are returned in a0.

## 4.1 Emulating x86 Bit Manipulation ISAs

The following code snippets implement all instructions from the x86 bit manipulation ISA extensions ABM, BMI1, BMI2, and TBM using RISC-V code that does not spill any registers and thus could easily be implemented in a single instruction using macro-op fusion. (Some of them simply map directly to instructions in this spec and so no macro-op fusion is needed.) Note that shorter RISC-V code sequences are possible if we allow spilling to temporary registers.

Table 4.1: Emulating other Bit Manipulation ISAs using macro-op fusion

| x86 Ext | x86 Instruction  | Ву  | tes | RISC-V Code      |
|---------|------------------|-----|-----|------------------|
|         |                  | x86 | RV  |                  |
| ABM     | popcnt           | 5   | 4   | pcnt a0, a0      |
|         | lzcnt            | 5   | 4   | clz a0, a0       |
| BMI1    | andn             | 5   | 4   | andc a0, a2, a1  |
|         | bextr $(regs)^1$ | 5   | 12  | c.add a0, a1     |
|         |                  |     |     | slo a0, zero, a0 |

<sup>&</sup>lt;sup>1</sup> The BMI1 bextr instruction expects the length and start position packed in one register operand. Our version expects the length in a0, start position in a1, and source value in a2.

| x86 Ext | x86 Instruction   | By<br>x86 | tes<br>RV | RISC-V Code                                      |
|---------|-------------------|-----------|-----------|--------------------------------------------------|
|         |                   |           |           | c.and a0, a2                                     |
|         |                   |           |           | srl a0, a0, a1                                   |
|         | blsi              | 5         | 6         | neg a0, a1                                       |
|         |                   |           |           | c.and a0, a1                                     |
|         | blsmsk            | 5         | 6         | addi a0, a1, -1                                  |
|         |                   |           |           | c.xor a0, a1                                     |
|         | blsr              | 5         | 6         | addi a0, a1, -1                                  |
|         |                   |           |           | c.and a0, a1                                     |
| BMI2    | bzhi              | 5         | 6         | slo a0, zero, a2                                 |
|         |                   |           |           | c.and a0, a1                                     |
|         | $\mathtt{mulx}^2$ | 5         | 4         | mul                                              |
|         | pdep              | 5         | 4         | bdep                                             |
|         | pext              | 5         | 4         | bext                                             |
|         | $rorx^2$          | 6         | 4         | rori                                             |
|         | $\mathtt{sarx}^2$ | 5         | 4         | sra                                              |
|         | $\mathtt{shrx}^2$ | 5         | 4         | srl                                              |
|         | $shlx^2$          | 5         | 4         | sll                                              |
| TBM     | bextr (imm)       | 7         | 4         | c.slli a0, (32-START-LEN)<br>c.srli a0, (32-LEN) |
|         | blcfill           | 5         | 6         | addi a0, a1, 1                                   |
|         |                   |           | Ü         | c.and a0, a1                                     |
|         | blci              | 5         | 8         | addi a0, a1, 1                                   |
|         |                   |           |           | c.not a0                                         |
|         |                   |           |           | c.or a0, a1                                      |
|         | blcic             | 5         | 10        | addi a0, a1, 1                                   |
|         |                   |           |           | andc a0, a1, a0                                  |
|         |                   |           |           | c.not a0                                         |
|         | blcmsk            | 5         | 6         | addi a0, a1, 1                                   |
|         |                   |           |           | c.xor a0, a1                                     |
|         | blcs              | 5         | 6         | addi a0, a1, 1                                   |
|         |                   |           |           | c.or a0, a1                                      |
|         | blsfill           | 5         | 6         | addi a0, a1, -1                                  |
|         |                   |           |           | c.or a0, a1                                      |
|         | blsic             | 5         | 10        | addi a0, a1, -1                                  |
|         |                   |           |           | andc a0, a1, a0                                  |
|         |                   |           |           | c.not a0                                         |
|         | t1mskc            | 5         | 10        | addi a0, a1, +1                                  |
|         |                   |           |           | andc a0, a1, a0                                  |
|         |                   |           |           | c.not a0                                         |
|         | t1msk             | 5         | 8         | addi a0, a1, -1                                  |
|         |                   |           |           | andc a0, a0, a1                                  |

The \*x BMI2 is nstructions just perform the indicated operation without changing any flags. RISC-V does not use flags, so this instructions trivially just map to their regular RISC-V counterparts.

There will be a separate RISC-V standard for recommended sequences for macro-op fusion. The macros listed here are merely for demonstrating that suitable sequences exist. We do not advocate for any of those sequences to become "standard sequences" for macro-op fusion.

### 4.2 Emulating RI5CY Bit Manipulation ISA

**TBD** 

### 4.3 Decoding RISC-V Immediates

The following code snippets decode the immedate from RISC-V S-type, B-type, J-type, and CJ-type instructions. They are nice "nothing up my sleeve"-examples for real-world bit permutations.

| 31 | 27        | 26   | 25  | 24       | 20        | 19    | 15 | 14 | 12 | 11   | 7       | 6 | 0 |        |
|----|-----------|------|-----|----------|-----------|-------|----|----|----|------|---------|---|---|--------|
|    | imm[11:   | 5]   |     |          |           |       |    |    |    | imn  | n[4:0]  |   |   | S-type |
|    | imm[12 10 | ):5] |     |          |           |       |    |    |    | imm[ | 4:1 11] |   |   | B-type |
|    |           |      | imn | n[20 10] | 0:1 11 1: | 9:12] |    |    |    |      |         |   |   | J-type |
|    |           |      |     |          |           |       |    |    |    |      |         |   |   |        |

```
decode_s:
  li t0, 0xfe000f80
  bext a0, a0, t0
  c.slli a0, 20
  c.srai a0, 20
  ret
decode_b:
  rori a0, a0, 8
  lui t0, 0x804eb
  shuffle a0, a0, t0
  li t0, 0x80fe0e01
  bext a0, a0, t0
  c.slli a0, 20
  c.srai a0, 19
  ret
// variant 1 (with shuffle/bext)
decode_j:
  lui t0, 0x0fffb
  shuffle a0, a0, t0
```

```
lui t0, 0x0f40a
  shuffle a0, a0, t0
  lui t0, 0x70fec
  shuffle a0, a0, t0
  li t0, 0x8ff170fe
  bext a0, a0, t0
  c.slli a0, 12
  c.srai a0, 11
  ret
// variant 2 (with bext but without shuffle)
decode_j:
 li t0, 0x800ff000
 li a1, 0x00100000
 bext a2, a0, t0
  c.and a1, a0
  c.slli a0, a0, 1
  c.srli a0, a0, 22
  c.slli a2, 23
  c.slli a1, 2
  c.slli a0, 12
  c.or a0, a2
  c.or a0, a1
  c.srai a0, 11
  ret
// variant 1 (with shuffle/bext)
decode_cj:
 grevi a0, a0, 1
  lui t0, 0xebcac
  shuffle a0, a0, t0
  lui t0, 0xe3469
  shuffle a0, a0, t0
  li t0, 0x8bc20464
 bext a0, a0, t0
  c.slli a0, 21
  c.srai a0, 20
 ret
// variant 2 (without shuffle/bext)
decode_cj:
  srli a5, a0, 2
  srli a4, a0, 7
  c.andi a4, 16
  slli a3, a0, 3
  c.andi a5, 14
  c.add a5, a4
  andi a3, a3, 32
```

```
srli a4, a0, 1
c.add a5, a3
andi a4, a4, 64
slli a2, a0, 1
c.add a5, a4
andi a2, a2, 128
srli a3, a0, 1
slli a4, a0, 19
c.add a5, a2
andi a3, a3, 768
c.slli a0, 2
c.add a5, a3
andi a0, a0, 1024
c.srai a4, 31
c.add a5, a0
slli a0, a4, 11
c.add a0, a5
ret
```

## Chapter 5

# Change History

Table 5.1: Summary of Changes

| Date       | Rev  | Changes                                                    |
|------------|------|------------------------------------------------------------|
| 2017-07-17 | 0.10 | Initial Draft                                              |
| 2017-11-02 | 0.11 | Removed roli, assembler can convert it to use a rori       |
|            |      | Removed bitwise subset and replaced with andc              |
|            |      | Doc source text same base for study and spec.              |
|            |      | Fixed typos                                                |
| 2017-11-30 | 0.32 | Jump rev number to be on par with associated Study         |
|            |      | Moved pdep/pext into spec draft and called it scattergaher |
| 2018-04-07 | 0.33 | Move to github, throw out study, convert from .md to .tex  |
|            |      | Fixed typos and fixed some reference C implementations     |
|            |      | Rename bgat/bsca to bext/bdep                              |
|            |      | Remove post-add immediate from clz                         |
|            |      | Clean up encoding tables and code sections                 |
| 2018-04-20 | 0.34 | Add GREV, CTZ, and compressed instructions                 |
|            |      | Restructure document: Move discussions to extra sections   |
|            |      | Add FAQ, add analysis of used encoding space               |
|            |      | Add Pseudo-Ops, Macros, Algorithms                         |
|            |      | Add Generalized Bit Permutations (shuffle)                 |