Skip to content

Example: Nexus Proof of Work Function

Julian Kemmerer edited this page Apr 25, 2021 · 47 revisions

This example discusses the compiler-autopipelining of a rather large cryptographic hash function.

Discord user Wolf9466 created a highly pipelined hand written Verilog implementation in addition to some test C code. It is the test C code which is synthesized as PipelineC and compared to the Verilog implementation synthesis results.

Some information on the hash function from Wolf9466: The cryptocurrency Nexus uses a SK1024 proof-of-work which consists of two Skein block processes and three Keccak block processes. It's run on the block header - Keccak-1024(Skein-1024(data))

Original C code test for the Verilog NXSTest.c

Modified C code for synthesis NXSTest_syn.c

C Code Changes for Synthesis

Pointers to fixed size arrays were replaced with fixed size arrays passed by value. This include the creation of structs to return when this was necessary. Run-time constant values as variables were replaced with macro unrolling, and inlining of some functions and some bulky bit manipulations ORing bits together were replaced by compiler built in bit manipulations functions. (need better constant propagation across function boundaries and graph partitioning/pattern matching to use as originally written - anyone want to help with graph optimizations?). memcpy and memset replaced with some arrays.h macros for copying array contents. Finally some function inner loops were made into isolated single iteration functions: for synthesis time/quality of results.

Main pipeline structure

#pragma MAIN_MHZ DoNXSTest 850.0 // Timing goal
// Two input port u64 arrays, and one u64 output
uint64_t DoNXSTest(uint64_t State[26], uint64_t Key[17])
{
  uint64_t Nonce = 0x00000001FCAFC045ULL;
  uint64_t P[16];
  
  // loads low 8 qwords
  //memcpy(P, State + 16, 64);
  uint32_t i;
  for(i = 0; i < 8; i+=1) P[i] = State[i+2];
  
  P[8] = State[24];
  P[9] = State[25];
  P[10] = Nonce;
  
  for(i = 11; i < 16; i+=1) P[i] = 0ULL;
  
  // T constants inside func
  SkeinRoundTest_t srt = SkeinRoundTest0(P, Key); 
  P = srt.State;
  Key = srt.Key;
  
  for(i = 0; i < 8; i+=1) Key[i] = P[i] ^ State[16 + i];
  
  Key[8] = P[8] ^ State[24];
  Key[9] = P[9] ^ State[25];
  Key[10] = P[10] ^ Nonce;
	
  for(i = 11; i < 16; i+=1) Key[i] = P[i];
  
  // Clear P, the state argument to the final block is zero.
  //memset(P, 0x00, 128);
  ARRAY_SET(P, 0x00, 16)
  Key[16] = SKEIN_KS_PARITY;
  for(i = 0; i < 16; i+=1) Key[16] ^= Key[i];

  // T constants inside func
  SkeinRoundTest_t srt = SkeinRoundTest1(P, Key);
  P = srt.State;
  Key = srt.Key;

  uint64_t KeccakState[25];
  // memset(KeccakState, 0x00, 200);
  ARRAY_SET(KeccakState, 0x00, 25)
  // Copying qwords 0 - 8 (inclusive, so 9 qwords).
  // Note this is technically an XOR operation, just
  // with zero in this instance.
  // memcpy(KeccakState, P, 72);
  ARRAY_COPY(KeccakState, P, 9) 
  
  keccackf_t k = keccakf(KeccakState);
  KeccakState = k.st;
  
  for(i = 0; i < 7; i+=1) KeccakState[i] ^= P[i + 9];

  KeccakState[7] ^= 0x0000000000000005ULL;
  KeccakState[8] ^= 0x8000000000000000ULL;
  
  k = keccakf(KeccakState);
  KeccakState = k.st;

  k = keccakf(KeccakState);
  KeccakState = k.st;
  
  return(KeccakState[6]);
}

Data flows from inputs to the output return value. This is a pure function and all function inputs and outputs are passed by value.

The PipelineC tool produces a (currently crude) visual aid 'pipeline map' for identifying the critical path delay through your code. Being text based it is too large to show here but it can be seen that the two SkeinRoundTest blocks are ~1/2 of the datapath total length. Followed by the three keccakf blocks which take up the later ~1/2 of the datapath.

Part for synthesis estimates: xcvu9p-flgb2104-2-i

The comb. logic path

  Source:                 Key_input_reg_reg[2][0]/C
  Destination:            return_output_output_reg_reg[45]/D
  Data Path Delay:        208.747ns  (logic 79.923ns (38.287%)  route 128.824ns (61.713%))
  Logic Levels:           904  (CARRY8=361 LUT1=1 LUT2=26 LUT3=41 LUT4=252 LUT5=152 LUT6=71)

It is this comb. logic that the PipelineC compiler begins to autopipeline.

The tool uses information about the synthesized comb. delays of various submodules/functions in the design to insert pipelining registers.

SkeinMix8 Path delay (ns): 1.181 = 846.74 MHz
keccakf_iter Path delay (ns): 1.043 = 958.77MHz
SkeinEvenRound Path delay (ns): 2.804 = 356.63 MHz
SkeinOddRound Path delay (ns): 2.823 = 354.233 MHz
keccakf Path delay (ns): 30.107 = 33.214 MHz
SkeinRoundTest0 Path delay (ns): 59.98 = 16.672 MHz
SkeinRoundTest1 Path delay (ns): 60.102 = 16.638 MHz
DoNXSTest Path delay (ns): 208.903 = 4.78 MHz

Below are the results from the fast coarse grain pipelining throughput sweep:

================== Beginning Throughput Sweep ================================
Function: DoNXSTest Target MHz: 850.0
Setting all instances to comb. logic to start...
Starting with blank sweep state...
...determining slicing information for each main function...
Starting coarse sweep...
Slicing up pipeline stages...
DoNXSTest : 176 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 418.93590280687056 ( 2.387 ns)
DoNXSTest : 220 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 460.617227084293 ( 2.171 ns)
DoNXSTest : 263 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 479.84644913627636 ( 2.084 ns)
DoNXSTest : 305 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 535.6186395286556 ( 1.867 ns)
DoNXSTest : 346 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 537.9236148466917 ( 1.859 ns)
DoNXSTest : 386 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 537.9236148466917 ( 1.859 ns)
DoNXSTest : 425 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 541.4185165132648 ( 1.847 ns)
DoNXSTest : 463 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 631.7119393556538 ( 1.583 ns)
DoNXSTest : 500 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 590.6674542232723 ( 1.693 ns)
DoNXSTest : 536 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 590.6674542232723 ( 1.693 ns)
DoNXSTest : 571 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 696.8641114982578 ( 1.435 ns) *
DoNXSTest : 605 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 696.8641114982578 ( 1.435 ns)
DoNXSTest : 638 clocks latency, sliced coarsely...
 Clock Goal (MHz): 850.0 , Current MHz: 696.8641114982578 ( 1.435 ns)

The tool is able to pipeline the original 208.903 ns path up to a 208.903/1.435 ns ~= 145x faster throughput, 696.86 MHz, 571 clocks deep pipeline.

Timing Report:

Slack (VIOLATED) :        -0.259ns  (required time - arrival time)
  Source:                 DoNXSTest/SkeinRoundTest0_NXSTest_syn_c_l446_c26_1a99.../C
  Destination:            DoNXSTest/SkeinRoundTest0_NXSTest_syn_c_l446_c26_1a99.../D
  Requirement:            1.176ns  (clk_850p0 rise@1.176ns - clk_850p0 rise@0.000ns)
  Data Path Delay:        1.280ns  (logic 0.319ns (24.922%)  route 0.961ns (75.078%))
  Logic Levels:           5  (LUT2=1 LUT4=1 LUT6=3)

Utilization:

+------+--------+-------+
|      |Cell    |Count  |
+------+--------+-------+
|1     |BUFG    |      1|
|2     |CARRY8  |  18112|
|3     |LUT1    |    146|
|4     |LUT2    | 253168|
|5     |LUT3    | 186722|
|6     |LUT4    |  58915|
|7     |LUT5    |  41453|
|8     |LUT6    |   9659|
|9     |SRL16E  |  42699|
|10    |SRLC32E |   5092|
|11    |FDRE    | 862090|
|12    |IBUF    |   2753|
|13    |OBUF    |     64|
+------+--------+-------+
Total LUTs: 550063

Verilog Hand Tuned Pipeline Synthesis Results

localparam TOTALSTAGES = (SKEINBLKSTAGES * 2) + (KECCAKBLKSTAGES * 3) + 2;

Wolf9466 designed a pipeline that splits the two Skein block processes and three Keccak block processes in total across 390 pipeline stages. Individual rounds of hash functions were identified as being small enough to fit into single clock cycles. From there an additional pipeline stage or two were added for intermediate operations that didn't fit evenly into other stages.

Part for synthesis estimates: xcvu33p-fsvh2104-2L-e

Timing Report:

An fmax of roughly ~846MHz = ~1.182ns period.

wtiming

Utilization

+----------------------------+--------+-------+-----------+-------+
|          Site Type         |  Used  | Fixed | Available | Util% |
+----------------------------+--------+-------+-----------+-------+
| CLB LUTs*                  | 333899 |     0 |    439680 | 75.94 |
|   LUT as Logic             | 331829 |     0 |    439680 | 75.47 |
|   LUT as Memory            |   2070 |     0 |    205440 |  1.01 |
|     LUT as Distributed RAM |      0 |     0 |           |       |
|     LUT as Shift Register  |   2070 |     0 |           |       |
| CLB Registers              | 529262 |     0 |    879360 | 60.19 |
|   Register as Flip Flop    | 529262 |     0 |    879360 | 60.19 |
|   Register as Latch        |      0 |     0 |    879360 |  0.00 |
| CARRY8                     |  15280 |     0 |     54960 | 27.80 |
| F7 Muxes                   |      0 |     0 |    219840 |  0.00 |
| F8 Muxes                   |      0 |     0 |    109920 |  0.00 |
| F9 Muxes                   |      0 |     0 |     54960 |  0.00 |
+----------------------------+--------+-------+-----------+-------+

Design Comparison PipelineC vs. Verilog

Two FPGAs of the same device family were used:

Verilog synthesized part: xcvu33p-fsvh2104-2L-e

PipelineC synthesized part: xcvu9p-flgb2104-2-i

Timing Paths:

Verilog: 390 pipeline stages PipelineC: 572 pipeline stages

Verilog: CARRY8=8 LUT2=1 path
     Likely a single u64 add. Reports more logic delay but less route delay compare to PipelineC path.
     FMAX ~= 846MHz
PipelineC: LUT2=1 LUT4=1 LUT6=3 path
     Reports less logic delay than the Verilog, but significantly more route delay.
     FMAX ~= 696MHz

Utilization:

            Verilog | PipelineC
Total LUTs:  333899 | 550063 (1.64x)
CARRY8:      15280  | 18112  (1.18x)
Registers:   529262 | 862090 (1.62x)

Verilog: ~1000 lines of code PipelineC: ~400 lines of code

Since the PipelineC tool pipelined so coarsely it overshot the perfectly balanced ~390 pipeline stages of the Verilog design, going up to 572 stages (well into a suboptimally pipelined situation). It is still unclear why the number of LUTs is significantly higher than expected.

PipelineC's coarse grain throughput sweep (attempt to push higher and higher fmax) is not well targeted to fine tuning on the order of <= 1 logic level - which is required for to match the precise pipeline balancing done in the Verilog design. (As opposed to how PipelineC is currently well targeted to lower latency, mid-range fmax designs).

Attaining the last few percent of theoretical device fmax depends on very precise stage delay balancing. It is still an open area of development to provide a coarse grain / 'initial guess' method of dividing pipeline stages that to takes into account ~<=1 logic level device specific effects.

If you are working on a PipelineC design at this scale/fmax, please contact me (Julian) as it would be a perfect driving motivation to help improve the tool further with some closer attention.

Clone this wiki locally