LLVM #14

suarezvictor · 2021-09-18T05:37:26Z

suarezvictor
Sep 18, 2021

I was able to compile a C function with LLVM (optimized output), then parse the assembly and convert to C in a compatible fashion with PipelineC (indeed generated code compiles). Example: a "fast" inverse square root function:

SOURCE:

float float_rsqrt( float number )
{
const float x2 = number * float_rsqrt_K0; //0.5f;
const float threehalfs = float_rsqrt_K1; //1.5f;
union {
float f;
uint32_t i;
} conv = { .f = number };
conv.i = float_rsqrt_K2 - ( conv.i >> 1 ); //0x5f3759df = 1597463007
conv.f *= threehalfs - ( x2 * conv.f * conv.f );
return conv.f;
}
AFTER PARSING LLVM IR OUTPUT AND GENERATING C

float llvm_dis_Z11float_rsqrtf( float a0)
{
float a2 = LOAD(llvm_dis_float_rsqrt_K0); // %2 = load float, float* @float_rsqrt_K0, align 4, !tbaa !3
float a3 = a2 * a0; // %3 = fmul float %2, %0
float a4 = LOAD(llvm_dis_float_rsqrt_K1); // %4 = load float, float* @float_rsqrt_K1, align 4, !tbaa !3
uint32_t a5 = BITCAST_I32(a0); // %5 = bitcast float %0 to i32
uint32_t a6 = LOAD(llvm_dis_float_rsqrt_K2); // %6 = load i32, i32* @float_rsqrt_K2, align 4, !tbaa !7
uint32_t a7 = a5 >> 1; // %7 = lshr i32 %5, 1
uint32_t a8 = a6 - a7; // %8 = sub i32 %6, %7
float a9 = BITCAST_FLOAT(a8); // %9 = bitcast i32 %8 to float
float a10 = a3 * a9; // %10 = fmul float %3, %9
float a11 = a10 * a9; // %11 = fmul float %10, %9
float a12 = a4 - a11; // %12 = fsub float %4, %11
float a13 = a12 * a9; // %13 = fmul float %12, %9
return a13; // ret float %13
}

No luck yet simulating it with CXXRTL/GHDL or even synthetizing with Quartus...
It seems that PipelineC isn't smart enough to use less than the 14 variables (not all intermediate values reach to the end of the pipeline) but that may be solved with another optimization path with LLVM or done manually ar the code generation phase.

JulianKemmerer · 2021-09-20T19:52:41Z

JulianKemmerer
Sep 20, 2021
Maintainer

Oh @suarezvictor I didnt really get to look at this in detail until now.

Not sure what running through LLVM was about? For optimizations eh? Would have to define/replace whatever built in LOAD and casting functions are doing...

if you are wanting to write custom floating point functions (ex. rsqrt) that is defintely do-able. In fact the FP operators themselves (multiply, add, etc) are implemented as PipelineC code you could base future work off of

float BIN_OP_MULT_float_float_float(float left, float right)
{
  // Get mantissa exponent and sign for both
  // LEFT
  uint23_t x_mantissa;
  x_mantissa = float_22_0(left);
  uint9_t x_exponent_wide;
  x_exponent_wide = float_30_23(left);
  uint1_t x_sign;
  x_sign = float_31_31(left);
  // RIGHT
  uint23_t y_mantissa;
  y_mantissa = float_22_0(right);
  uint9_t y_exponent_wide;
  y_exponent_wide = float_30_23(right);
  uint1_t y_sign;
  y_sign = float_31_31(right);
  
  // Declare the output portions
  uint23_t z_mantissa;
  uint8_t z_exponent;
  uint1_t z_sign;
  
  // Sign
  z_sign = x_sign ^ y_sign;
  
  // Multiplication with infinity = inf
  if((x_exponent_wide==255) | (y_exponent_wide==255))
  {
    z_exponent = 255;
    z_mantissa = 0;
  }
  // Multiplication with zero = zero
  else if((x_exponent_wide==0) | (y_exponent_wide==0))
  {
    z_exponent = 0;
    z_mantissa = 0;
    z_sign = 0;
  }
  // Normal non zero|inf mult
  else
  {
    // Delcare intermediates
    uint1_t aux;
    uint24_t aux2_x;
    uint24_t aux2_y;
    uint48_t aux2;
    uint7_t BIAS;
    BIAS = 127;
    
    aux2_x = uint1_uint23(1, x_mantissa);
    aux2_y = uint1_uint23(1, y_mantissa);
    aux2 =  aux2_x * aux2_y;
    // args in Q23 result in Q46
    aux = uint48_47_47(aux2);
    if(aux) //if(aux == 1)
    { 
      // >=2, shift left and add one to exponent
      // HACKY NO ROUNDING + aux2(23); // with round18
      z_mantissa = uint48_46_24(aux2); 
    }
    else
    { 
      // HACKY NO ROUNDING + aux2(22); // with rounding
      z_mantissa = uint48_45_23(aux2); 
    }
    
    // calculate exponent in parts 
    // do sequential unsigned adds and subs to avoid signed numbers for now
    // X and Y exponent are already 1 bit wider than needed
    // (0 & x_exponent) + (0 & y_exponent);
    uint9_t exponent_sum = x_exponent_wide + y_exponent_wide;
    exponent_sum = exponent_sum + aux;
    exponent_sum = exponent_sum - BIAS;
    
    // HACKY NOT CHECKING
    // if (exponent_sum(8)='1') then
    z_exponent = uint9_7_0(exponent_sum);
  }
  
  
  // Assemble output
  return float_uint1_uint8_uint23(z_sign, z_exponent, z_mantissa);
}

Currently that code is internally generated and used. Youll see that it doesnt use unions and bitfields which are a complicated when, as in hardware, you have no traditional 'layout of bytes in memory' folks might expect. Instead it uses some built in bit manipulation functions https://github.com/JulianKemmerer/PipelineC/wiki/Automatically-Generated-Functionality#bitmanip

Let me know if you want any help, getting Quartus running to synthesize some stuff etc

0 replies

suarezvictor · 2021-09-21T13:39:36Z

suarezvictor
Sep 21, 2021
Author

I did the LLVM example to get an optimized version of a defined C algorithm, for example 1/sqrt as shown. My plan is to run in pipeline C with CXXRTL (and count number of cycles) and un that CXXRTL main.c file also run the original algorithm and compare results are equal. You may make that automatically when generating CXXRTL code. Another advantage of using such optimized CXXRTL code is that it imposses less demand on sintax to pipelineC (i.e. no need for unions). The LOAD function just correaponds to a memory access and implementation both in C and pipeline C is to return the argument (identity function) or as a simple macro # define LOAD(a) (a) It was left there for clarity Once such proposed system of comparing the results un simulator and real C, a complete math library can be built. Want operations with complex numbers? Overloaded * operators? Quaternions? Implement it in C with templates or whatever you like: LLVM will convert it to simple binary or unary operations on flotas easy to process with pipelineC. A tool I'm doing converts LLVM IR assembly output to simple (pipelineC compatible) C. I can continue improving that tool (next step is reuse of variables to lower its count) and you can help me with al quirks with pipelineC including passing it to quartus. El lun., 20 sep. 2021 16:52, Julian Kemmerer ***@***.***> escribió:

…

Oh @suarezvictor <https://github.com/suarezvictor> I didnt really get to look at this in detail until now. Not sure what running through LLVM was about? For optimizations eh? Would have to define/replace whatever built in LOAD and casting functions are doing... if you are wanting to write custom floating point functions (ex. rsqrt) that is defintely do-able. In fact the FP operators themselves (multiply, add, etc) implemented as PipelineC code you could base future work off of float BIN_OP_MULT_float_float_float(float left, float right) { // Get mantissa exponent and sign for both // LEFT uint23_t x_mantissa; x_mantissa = float_22_0(left); uint9_t x_exponent_wide; x_exponent_wide = float_30_23(left); uint1_t x_sign; x_sign = float_31_31(left); // RIGHT uint23_t y_mantissa; y_mantissa = float_22_0(right); uint9_t y_exponent_wide; y_exponent_wide = float_30_23(right); uint1_t y_sign; y_sign = float_31_31(right); // Declare the output portions uint23_t z_mantissa; uint8_t z_exponent; uint1_t z_sign; // Sign z_sign = x_sign ^ y_sign; // Multiplication with infinity = inf if((x_exponent_wide==255) | (y_exponent_wide==255)) { z_exponent = 255; z_mantissa = 0; } // Multiplication with zero = zero else if((x_exponent_wide==0) | (y_exponent_wide==0)) { z_exponent = 0; z_mantissa = 0; z_sign = 0; } // Normal non zero|inf mult else { // Delcare intermediates uint1_t aux; uint24_t aux2_x; uint24_t aux2_y; uint48_t aux2; uint7_t BIAS; BIAS = 127; aux2_x = uint1_uint23(1, x_mantissa); aux2_y = uint1_uint23(1, y_mantissa); aux2 = aux2_x * aux2_y; // args in Q23 result in Q46 aux = uint48_47_47(aux2); if(aux) //if(aux == 1) { // >=2, shift left and add one to exponent // HACKY NO ROUNDING + aux2(23); // with round18 z_mantissa = uint48_46_24(aux2); } else { // HACKY NO ROUNDING + aux2(22); // with rounding z_mantissa = uint48_45_23(aux2); } // calculate exponent in parts // do sequential unsigned adds and subs to avoid signed numbers for now // X and Y exponent are already 1 bit wider than needed // (0 & x_exponent) + (0 & y_exponent); uint9_t exponent_sum = x_exponent_wide + y_exponent_wide; exponent_sum = exponent_sum + aux; exponent_sum = exponent_sum - BIAS; // HACKY NOT CHECKING // if (exponent_sum(8)='1') then z_exponent = uint9_7_0(exponent_sum); } // Assemble output return float_uint1_uint8_uint23(z_sign, z_exponent, z_mantissa); } Currently that code is internally generated and used. Youll see that it doesnt use unions and bitfields which are a complicated when, as in hardware, you have no traditional 'layout of bytes in memory' folks might expect. Instead it uses some built in bit manipulation functions https://github.com/JulianKemmerer/PipelineC/wiki/Automatically-Generated-Functionality#bitmanip Let me know if you want any help, getting Quartus running to synthesize some stuff etc — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBHVWLFZGL5WUA6ANSJAZTUC6GJLANCNFSM5EI2MBHQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

1 reply

JulianKemmerer Sep 23, 2021
Maintainer

I really love the idea of getting to use the output of LLVM
So many times it would be nice to deal with some template types, etc like you said - i.e. modern language luxuries
Excited to keep seeing output from this

suarezvictor · 2021-09-21T13:44:56Z

suarezvictor
Sep 21, 2021
Author

Forgot to say that to build such math library I can reuse code from existing C and C++ libraries without any touch and that would include their already written tests. El mar., 21 sep. 2021 10:39, Victor Suarez Rovere ***@***.***> escribió:

…

I did the LLVM example to get an optimized version of a defined C algorithm, for example 1/sqrt as shown. My plan is to run in pipeline C with CXXRTL (and count number of cycles) and un that CXXRTL main.c file also run the original algorithm and compare results are equal. You may make that automatically when generating CXXRTL code. Another advantage of using such optimized CXXRTL code is that it imposses less demand on sintax to pipelineC (i.e. no need for unions). The LOAD function just correaponds to a memory access and implementation both in C and pipeline C is to return the argument (identity function) or as a simple macro # define LOAD(a) (a) It was left there for clarity Once such proposed system of comparing the results un simulator and real C, a complete math library can be built. Want operations with complex numbers? Overloaded * operators? Quaternions? Implement it in C with templates or whatever you like: LLVM will convert it to simple binary or unary operations on flotas easy to process with pipelineC. A tool I'm doing converts LLVM IR assembly output to simple (pipelineC compatible) C. I can continue improving that tool (next step is reuse of variables to lower its count) and you can help me with al quirks with pipelineC including passing it to quartus. El lun., 20 sep. 2021 16:52, Julian Kemmerer ***@***.***> escribió: > Oh @suarezvictor <https://github.com/suarezvictor> I didnt really get to > look at this in detail until now. > > Not sure what running through LLVM was about? For optimizations eh? Would > have to define/replace whatever built in LOAD and casting functions are > doing... > > if you are wanting to write custom floating point functions (ex. rsqrt) > that is defintely do-able. In fact the FP operators themselves (multiply, > add, etc) implemented as PipelineC code you could base future work off of > > float BIN_OP_MULT_float_float_float(float left, float right) > { > // Get mantissa exponent and sign for both > // LEFT > uint23_t x_mantissa; > x_mantissa = float_22_0(left); > uint9_t x_exponent_wide; > x_exponent_wide = float_30_23(left); > uint1_t x_sign; > x_sign = float_31_31(left); > // RIGHT > uint23_t y_mantissa; > y_mantissa = float_22_0(right); > uint9_t y_exponent_wide; > y_exponent_wide = float_30_23(right); > uint1_t y_sign; > y_sign = float_31_31(right); > > // Declare the output portions > uint23_t z_mantissa; > uint8_t z_exponent; > uint1_t z_sign; > > // Sign > z_sign = x_sign ^ y_sign; > > // Multiplication with infinity = inf > if((x_exponent_wide==255) | (y_exponent_wide==255)) > { > z_exponent = 255; > z_mantissa = 0; > } > // Multiplication with zero = zero > else if((x_exponent_wide==0) | (y_exponent_wide==0)) > { > z_exponent = 0; > z_mantissa = 0; > z_sign = 0; > } > // Normal non zero|inf mult > else > { > // Delcare intermediates > uint1_t aux; > uint24_t aux2_x; > uint24_t aux2_y; > uint48_t aux2; > uint7_t BIAS; > BIAS = 127; > > aux2_x = uint1_uint23(1, x_mantissa); > aux2_y = uint1_uint23(1, y_mantissa); > aux2 = aux2_x * aux2_y; > // args in Q23 result in Q46 > aux = uint48_47_47(aux2); > if(aux) //if(aux == 1) > { > // >=2, shift left and add one to exponent > // HACKY NO ROUNDING + aux2(23); // with round18 > z_mantissa = uint48_46_24(aux2); > } > else > { > // HACKY NO ROUNDING + aux2(22); // with rounding > z_mantissa = uint48_45_23(aux2); > } > > // calculate exponent in parts > // do sequential unsigned adds and subs to avoid signed numbers for now > // X and Y exponent are already 1 bit wider than needed > // (0 & x_exponent) + (0 & y_exponent); > uint9_t exponent_sum = x_exponent_wide + y_exponent_wide; > exponent_sum = exponent_sum + aux; > exponent_sum = exponent_sum - BIAS; > > // HACKY NOT CHECKING > // if (exponent_sum(8)='1') then > z_exponent = uint9_7_0(exponent_sum); > } > > > // Assemble output > return float_uint1_uint8_uint23(z_sign, z_exponent, z_mantissa); > } > > Currently that code is internally generated and used. Youll see that it > doesnt use unions and bitfields which are a complicated when, as in > hardware, you have no traditional 'layout of bytes in memory' folks might > expect. Instead it uses some built in bit manipulation functions > https://github.com/JulianKemmerer/PipelineC/wiki/Automatically-Generated-Functionality#bitmanip > > Let me know if you want any help, getting Quartus running to synthesize > some stuff etc > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#14 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACBHVWLFZGL5WUA6ANSJAZTUC6GJLANCNFSM5EI2MBHQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > >

0 replies

suarezvictor · 2021-09-23T05:29:02Z

suarezvictor
Sep 23, 2021
Author

Register reuse optimization
reduces the number or register (local variables) used. For example from 12 locals to 7

`
float llvm_dis_Z11float_rsqrtf( FLOAT a0)
{
FLOAT a2;
FLOAT a3;
FLOAT a4;
uint32_t a5;
uint32_t a6;
uint32_t a7;
uint32_t a8;
FLOAT a9;
FLOAT a10;
FLOAT a11;
FLOAT a12;
FLOAT a13;
a2 = LOAD(llvm_dis_float_rsqrt_K0); // %2 = load float, float* @float_rsqrt_K0, align 4, !tbaa !3
a3 = a2 * a0; // %3 = fmul float %2, %0
a4 = LOAD(llvm_dis_float_rsqrt_K1); // %4 = load float, float* @float_rsqrt_K1, align 4, !tbaa !3
a5 = BITCAST_I32(a0); // %5 = bitcast float %0 to i32
a6 = LOAD(llvm_dis_float_rsqrt_K2); // %6 = load i32, i32* @float_rsqrt_K2, align 4, !tbaa !7
a7 = a5 >> 1; // %7 = lshr i32 %5, 1
a8 = a6 - a7; // %8 = sub i32 %6, %7
a9 = BITCAST_FLOAT(a8); // %9 = bitcast i32 %8 to float
a10 = a3 * a9; // %10 = fmul float %3, %9
a11 = a10 * a9; // %11 = fmul float %10, %9
a12 = a4 - a11; // %12 = fsub float %4, %11
a13 = a12 * a9; // %13 = fmul float %12, %9
return a13; // ret float %13
}

After:
float llvm_dis_Z11float_rsqrtf( FLOAT a0)
{
FLOAT a3;
FLOAT a2;
FLOAT a9;
uint32_t a5;
FLOAT a10;
uint32_t a6;
uint32_t a7;
//local variables count optimization: from 12 to 7
a2 = LOAD(llvm_dis_float_rsqrt_K0); // %2 = load float, float* @float_rsqrt_K0, align 4, !tbaa !3
a3 = a2 * a0; // %3 = fmul float %2, %0
a2 = LOAD(llvm_dis_float_rsqrt_K1); // %4 = load float, float* @float_rsqrt_K1, align 4, !tbaa !3
a5 = BITCAST_I32(a0); // %5 = bitcast float %0 to i32
a6 = LOAD(llvm_dis_float_rsqrt_K2); // %6 = load i32, i32* @float_rsqrt_K2, align 4, !tbaa !7
a7 = a5 >> 1; // %7 = lshr i32 %5, 1
a5 = a6 - a7; // %8 = sub i32 %6, %7
a9 = BITCAST_FLOAT(a5); // %9 = bitcast i32 %8 to float
a10 = a3 * a9; // %10 = fmul float %3, %9
a3 = a10 * a9; // %11 = fmul float %10, %9
a10 = a2 - a3; // %12 = fsub float %4, %11
a3 = a10 * a9; // %13 = fmul float %12, %9
return a3; // ret float %13
}

`

2 replies

JulianKemmerer Sep 23, 2021
Maintainer

@suarezvictor important to note.
For pure (no static locals) PipelineC functions (i.e. to be autopipelined ~HLS like)
The number of local variables is not necessarily tied to the number of registers.
It all depends on where the tool inserts pipelining registers (otherwise only registers are inserted for input/args return/output).
I.e. Its possible for the function to use no registers or lots.

You can think of local variables as wires for now and I wouldnt spend too much time optimizing there just yet (worry about how many primitive math operations on values you do)

(Now if you get into FSM/sequential pipelinec funcitons then yes local variables equate to regs but thats more experimental)

JulianKemmerer Sep 23, 2021
Maintainer

In fact - because of the way PipelineC extracts parallellism (dependency graph built from usage) reusing the same local variables makes longer sequential pipelines - which might be what you want - but often folks prefer 'as parallel as possible' which would in C code equate to a new local for each assignment - dont need to go that far, . Just trying to get at the 'how many local variables to use' problem and how it changes the output hardware

suarezvictor · 2021-09-23T16:42:41Z

suarezvictor
Sep 23, 2021
Author

Nice to know you are already doing interesting optimizations. Would you elaborate how to estimate how many registers are used? As far as I understood you need to make a matrix with number of columns corresponding to amount maximum register usage at a determinate time, and rows corresponding to the latency.
I'd also appreciate to know how pipeline depth and frequency affects area usage (ie. like the latency vs. frequency graphs)

0 replies

suarezvictor · 2021-09-23T16:43:23Z

suarezvictor
Sep 23, 2021
Author

(I mean by latency = pipeline depth)

0 replies

suarezvictor · 2021-09-23T16:48:46Z

suarezvictor
Sep 23, 2021
Author

in regards to register reusage making "longer pipelines", I think most times latency is not very relevant, we want "results per second". As far as I understood, making pipeline depth larger doesn't affect throughput, and avoiding parallelism saves resource usage. I am understanding? how to control such tradeoff with PipelineC? In the project I'm thinking I prefer to save resource usage, maybe reusing registers is the way.
Another question is: do you report somewhere latency, amount of register usage and the like, so we are able to optimize designs?

0 replies

suarezvictor · 2021-09-23T16:50:40Z

suarezvictor
Sep 23, 2021
Author

for example I'd like to pass both codes (with register usage and without) and see how resource usage and latency is affected

0 replies

JulianKemmerer · 2021-09-23T19:02:58Z

JulianKemmerer
Sep 23, 2021
Maintainer

@suarezvictor what FPGA are you targeting? Can you give me your full part string?
Examples (you could pick one if you dont have a part in mind yet)

// Full part string examples:
// xc7a35ticsg324-1l     Artix 7 (Arty)
// xcvu9p-flgb2104-2-i   Virtex Ultrascale Plus (AWS F1)
// xcvu33p-fsvh2104-2-e  Virtex Ultrascale Plus
// xc7v2000tfhg1761-2    Virtex 7
// EP2AGX45CU17I3        Arria II GX
// 10CL120ZF780I8G       Cyclone 10 LP
// 5CGXFC9E7F35C8        Cyclone 5 GX
// 10M50SCE144I7G        Max 10
// LFE5U-85F-6BG381C     ECP5U
// LFE5UM5G-85F-8BG756C  ECP5UM5G 
// ICE40UP5K-SG48        ICE40UP
// T8F81                 Trion T8 (Xyloni)
// Ti60F225              Titanium

1 reply

suarezvictor Sep 23, 2021
Author

Part is EP4CE22F17C6
It's a Cyclone IV on the Terasic DE0-Nano board

JulianKemmerer · 2021-09-24T06:04:07Z

JulianKemmerer
Sep 24, 2021
Maintainer

@suarezvictor I plan on reporting on more details but after fixing some Quartus things you could try it yourself too
./src/pipelinec ./examples/llvm/rsqrtf.c --coarse

After seeing how the --coarse sweep goes you might be interested in supplying the --start and/or --sweep args (to limit/expand scope of latency sweep). And if you have an actual FMAX in mind you can try setting MAIN_MHZ pragma in the c code and running without --coarse. Though if the design is small you might get better results with --coarse. Lots of experimenting/tuning to try.

More info to come of pipeline hardware details

0 replies

JulianKemmerer · 2021-09-24T19:12:10Z

JulianKemmerer
Sep 24, 2021
Maintainer

@suarezvictor Do you have a Quartus paid license? (maybe dev board includes one?) I.e. Not the 'Lite' version of quartus?
I think to get pipelining using DSPs have two or three options

1) Use paid Quartus
2) Manually instantiate DSP prims (prim library C funcs) 
3) Help me figure out how to make the compiler convert big multiplies into smaller ones that map to hardware prims

Alternatively you can always try getting results not using DSPs - FPGA fabric only.

What do you think about this situation?

0 replies

JulianKemmerer · 2021-09-24T19:47:00Z

JulianKemmerer
Sep 24, 2021
Maintainer

Btw you can now switch between the styles of multiplier used globally with --mult infer or --mult fabric now (needed for internally generated float mult C code at the moment)
b4a88d0
(and still can control per func def in user code with pragmas - see pragma docs https://github.com/JulianKemmerer/PipelineC/wiki/Pragmas)

0 replies

JulianKemmerer · 2021-09-25T05:04:17Z

JulianKemmerer
Sep 25, 2021
Maintainer

@suarezvictor I have essentially done #2/3 for you in some stuff inside https://github.com/JulianKemmerer/PipelineC/blob/master/primitives/cyclone_iv.c

./src/pipelinec ./examples/llvm/rsqrtf.c --coarse should be a pretty representation of of the types of pipeline fmaxs and area to expect (though still some optimizations for fewer dsps - Karatusba algorithm, 1/4 fewer dsps could be done it seems)

We can talk long term about getting this 'built into the compiler'

0 replies

suarezvictor · 2021-09-25T16:50:00Z

suarezvictor
Sep 25, 2021
Author

Tour advances are great. I'm trying to allocate a bit of time to test it all. Some ideas for improvements are the following: -Use 9x9 multipliers that maybe use less area, takes less time and permit more registers insertion -Not use a 24x24 to 48 multiplier function since the mantisa is a fixed point representation and we dont'n need the 24 least significante bits (so no need to calculate them) Let's say we need to multiply two mantisa values A and B, 24 bit each: Value a can be represented as 1.abc with a b and c being 8 bit valies. B is 1.xyz, same width. Then you can mult: A*B=(a*x+a*y+a*z+b*x+b*y+c*x) that is not calculating least significant term b*z+c*y+c*z. Using the full 9 bits instead of 8 will solve the carry issues from the least significant term. Hopefully you grasp the idea if not I can do an implementation. I beg your patient until I finish some paid work and can get to this. El sáb., 25 sep. 2021 02:04, Julian Kemmerer ***@***.***> escribió:

…

@suarezvictor <https://github.com/suarezvictor> I have essentially done #2 <#2>/3 for you in some stuff inside https://github.com/JulianKemmerer/PipelineC/blob/master/primitives/cyclone_iv.c ./src/pipelinec ./examples/llvm/rsqrtf.c --coarse should be a pretty representation of of the types of pipeline fmaxs and area to expect (though still some optimizations for fewer dsps - Karatusba algorithm, 1/4 fewer dsps could be done it seems) We can talk long term about getting this 'built into the compiler' — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBHVWJZQ64C42BVFEHLFYDUDVJ5XANCNFSM5EI2MBHQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

2 replies

JulianKemmerer Sep 25, 2021
Maintainer

I think I can implement that - I will add it to the experimental cycline_iv primitives file and see what I get

JulianKemmerer Sep 25, 2021
Maintainer

Thanks for your help so far - happy to be patience :)

suarezvictor · 2021-09-25T16:55:20Z

suarezvictor
Sep 25, 2021
Author

Oh, and I don't plan to use the paid versions of the tools, indeed my plan is to try to avandon such closed source tools as soon as posible even if I need to use chips from other vendors like Lattice El sáb., 25 sep. 2021 13:49, Victor Suarez Rovere ***@***.***> escribió:

…

Tour advances are great. I'm trying to allocate a bit of time to test it all. Some ideas for improvements are the following: -Use 9x9 multipliers that maybe use less area, takes less time and permit more registers insertion -Not use a 24x24 to 48 multiplier function since the mantisa is a fixed point representation and we dont'n need the 24 least significante bits (so no need to calculate them) Let's say we need to multiply two mantisa values A and B, 24 bit each: Value a can be represented as 1.abc with a b and c being 8 bit valies. B is 1.xyz, same width. Then you can mult: A*B=(a*x+a*y+a*z+b*x+b*y+c*x) that is not calculating least significant term b*z+c*y+c*z. Using the full 9 bits instead of 8 will solve the carry issues from the least significant term. Hopefully you grasp the idea if not I can do an implementation. I beg your patient until I finish some paid work and can get to this. El sáb., 25 sep. 2021 02:04, Julian Kemmerer ***@***.***> escribió: > @suarezvictor <https://github.com/suarezvictor> I have essentially done > #2 <#2>/3 for you in > some stuff inside > https://github.com/JulianKemmerer/PipelineC/blob/master/primitives/cyclone_iv.c > > ./src/pipelinec ./examples/llvm/rsqrtf.c --coarse should be a pretty > representation of of the types of pipeline fmaxs and area to expect (though > still some optimizations for fewer dsps - Karatusba algorithm, 1/4 fewer > dsps could be done it seems) > > We can talk long term about getting this 'built into the compiler' > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#14 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACBHVWJZQ64C42BVFEHLFYDUDVJ5XANCNFSM5EI2MBHQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > >

0 replies

suarezvictor · 2021-09-25T22:37:23Z

suarezvictor
Sep 25, 2021
Author

`
Since a 3-digit value in base X can be decomposed in XXa+Xb+c

r = (XXa + Xb + c) * (XXd + Xe + f) = XXXXad + XXX(ae+bd) + XX*(af+be+cd) + X*(bf+ce) + cf

assuming 9-bit numbers (X = 1<<9), and ignoring least significant terms bf, ce, cf (i.e truncating), and dividing by XX (common shift of all numbers), then

r' = r/XX = XXad + X(ae+bd) + (af+be+cd)
r' = ad<<18 + (ae+bd)<<9 + (af+be+cd)

note that 9 least significant bits of cd, be, af can be ignored (doesn't contribute to the truncated value)

so 27x27 bit multiplication with 27 bit results become:
uint9_t a, b, c, d, e, f; //extract 9 bit fields

uint9_t cd = MSB_9(c * d);
uint10_t be_cd = cd + MSB_9(b * e);
uint11_t be_cd_af = be_cd + MSB_9(a * f);

uint18_t bd = b * d;
uint19_t ae_bd = bd + a * e;

uint18_t ad = a * d;
uint27_t r = (ad << 9) + (ae_bd + be_cd_af);
`

this may work

0 replies

suarezvictor · 2021-09-25T22:51:07Z

suarezvictor
Sep 25, 2021
Author

In case of Cyclone IV devices, fastest grade (C6 models) a 9×9-bit multiplier should reach 340 MHz and a 18×18-bit multiplier, 287 MHz

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/cyclone-iv/cyiv-53001.pdf

1 reply

JulianKemmerer Sep 26, 2021
Maintainer

@suarezvictor Do you have a cyclone iv rsqrtf to compare to? fmax+resources? (Does it get near those max fmax values you list?)

suarezvictor · 2021-09-25T23:13:25Z

suarezvictor
Sep 25, 2021
Author

Idea: considering that 3 bits would be discarded from the 27, first multipliers needs to be of just 3x3 bits and 3 bit results (i.e. 6 bits LUT should be enough). That may be better implemented in fabric thus half of multipliers are saved

0 replies

suarezvictor · 2021-09-26T00:09:42Z

suarezvictor
Sep 26, 2021
Author

Below are logic equations to form a 3x3 bit fixed point multiplier using just logic.
(that is, MSB_3(x*y) with uint3_t x,y)


uint1_t a,b,c; //first multiplicant: a=bit2, b=bit1, c=bit0
uint1_t d,e,f; //second multiplicant: d=bit2, e=bit1, f=bit0

uint_1 r2 = a&(e|!c)&(f|!b)&(b|d|e)&(c|d|f)&(d|e|f)&(!b|!d)&(!b|!e)&(!c|!d)&(!c|!f)&(!d|!e)&(!d|!f)&(b|e|!f)&(c|f|!e);
uint_1 r1 = b&(f|!a)&(!a|!c)&(!a|!d)&(!e|!f)&(c|!d|!e)&(d|!c|!e)&(a|c|d|!f)&(!c|!d|!f);
uint_1 r0 = c&(e|!a)&(!a|!d)&(d|!b|!e)&(d|!e|!f)&(e|!d|!f)&(!b|!e|!f);
 
//r2,r1,r0 => 3 bit MSB results of 3x3 multiplication

Calculated with:
https://docs.sympy.org/latest/modules/logic.html

l2 = [x*y for x in range(8) for y in range(8) if ((x*y)&int(32))>0]
l1 = [x*y for x in range(8) for y in range(8) if ((x*y)&int(16))>0]
l0 = [x*y for x in range(8) for y in range(8) if ((x*y)&int(8))>0]
a, b, c, d, e, f = symbols('a b c d e f')

str(POSform([a,b,c,d,e,f], l2))
str(POSform([a,b,c,d,e,f], l1))
str(POSform([a,b,c,d,e,f], l0))

0 replies

suarezvictor · 2021-09-26T01:35:20Z

suarezvictor
Sep 26, 2021
Author

Sorry, I couldn't test it with actual hardware just as cause of lack of time. I'm eaguer to do all tests regarding our discussions but I'm with pending work yet. El sáb., 25 sep. 2021 21:55, Julian Kemmerer ***@***.***> escribió:

…

@suarezvictor <https://github.com/suarezvictor> Do you have a cyclone iv rsqrtf to compare to? fmax+resources? (Does it get near those max fmax values you list?) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBHVWI6X6GUL3MX4MLVOELUDZVRHANCNFSM5EI2MBHQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

suarezvictor · 2021-09-26T01:38:00Z

suarezvictor
Sep 26, 2021
Author

I just posted fmax of the multipliers as listed on the datasheet so if implementation reach those values, there's nothing to optimize El sáb., 25 sep. 2021 21:55, Julian Kemmerer ***@***.***> escribió:

…

@suarezvictor <https://github.com/suarezvictor> Do you have a cyclone iv rsqrtf to compare to? fmax+resources? (Does it get near those max fmax values you list?) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBHVWI6X6GUL3MX4MLVOELUDZVRHANCNFSM5EI2MBHQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

1 reply

JulianKemmerer Sep 26, 2021
Maintainer

I understand now. I am pretty sure the tool can reach that fmax - just at what resource cost - we'll see!

JulianKemmerer · 2021-09-26T03:12:04Z

JulianKemmerer
Sep 26, 2021
Maintainer

@suarezvictor I am running the tool with an interpretation of your code
./src/pipelinec ./examples/llvm/rsqrtf.c --coarse

Can you comment on my implementation?

PipelineC/primitives/cyclone_iv.c

Line 61 in 40ad626

uint27_t mult27x27(uint27_t x, uint27_t y)

And do you know the proper way to use this 27b multiplier in the context of the 24b floating point mantissa?
See old commented out code for 24x24b=48b mult

PipelineC/primitives/cyclone_iv.c

Line 140 in 40ad626

/*

0 replies

suarezvictor · 2021-09-26T14:11:59Z

suarezvictor
Sep 26, 2021
Author

To use 27 bit multiplier you should use same technique as with 48 bit code: only use the 24 most significant bits of results. And obviously when passing 27 bit arguments, do a 3 bit shift left to such 24 bit arguments. uint24_t aux2_x = uint1_uint23(1, x_mantissa); uint24_t aux2_y = uint1_uint23(1, y_mantissa); #if 0 // previous 48 bit code uint48_t aux2 = LPM_MULT24X24(aux2_x, aux2_y); If(msb(aux2)) {aux2<<=1; --exp;} uint23_t z_mantissa = uint48_46_24(aux2); #else //27 bit code uint27_t aux2 = mult27x27(aux2_x <<3, aux2_y <<3); If(msb(aux2)) {aux2>>=1; ++exp;} uint23_t z_mantissa = uint48_25_3(aux2); //discard always-1 MSB and use 23 bit MSB #endif Please consider this is nos tested, only general idea. Main thin is this: mantissa is a fixed point representations of 1.x values (one plus a number between 0 and 1). Using more bits only add precision to the results but positions remanin the same (msb bit is always 1, second msb represents 0.5, third 0.25...) El dom., 26 sep. 2021 00:12, Julian Kemmerer ***@***.***> escribió:

…

@suarezvictor <https://github.com/suarezvictor> I am running the tool with an interpretation of your code Can you comment on my implementation? https://github.com/JulianKemmerer/PipelineC/blob/40ad626b1ec649aebe317dda2fdfe18e606e2d42/primitives/cyclone_iv.c#L61 And do you know the proper way to use this 27b multiplier in the context of the 24b floating point mantissa? See old commented out code for 24x24b=48b mult https://github.com/JulianKemmerer/PipelineC/blob/40ad626b1ec649aebe317dda2fdfe18e606e2d42/primitives/cyclone_iv.c#L140 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBHVWKRGCOUTG3SR6USCFLUD2FQ7ANCNFSM5EI2MBHQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

suarezvictor · 2021-09-26T14:12:59Z

suarezvictor
Sep 26, 2021
Author

PS: sorry for the error ajusting exponents, be careful! El dom., 26 sep. 2021 11:11, Victor Suarez Rovere ***@***.***> escribió:

…

To use 27 bit multiplier you should use same technique as with 48 bit code: only use the 24 most significant bits of results. And obviously when passing 27 bit arguments, do a 3 bit shift left to such 24 bit arguments. uint24_t aux2_x = uint1_uint23(1, x_mantissa); uint24_t aux2_y = uint1_uint23(1, y_mantissa); #if 0 // previous 48 bit code uint48_t aux2 = LPM_MULT24X24(aux2_x, aux2_y); If(msb(aux2)) {aux2<<=1; --exp;} uint23_t z_mantissa = uint48_46_24(aux2); #else //27 bit code uint27_t aux2 = mult27x27(aux2_x <<3, aux2_y <<3); If(msb(aux2)) {aux2>>=1; ++exp;} uint23_t z_mantissa = uint48_25_3(aux2); //discard always-1 MSB and use 23 bit MSB #endif Please consider this is nos tested, only general idea. Main thin is this: mantissa is a fixed point representations of 1.x values (one plus a number between 0 and 1). Using more bits only add precision to the results but positions remanin the same (msb bit is always 1, second msb represents 0.5, third 0.25...) El dom., 26 sep. 2021 00:12, Julian Kemmerer ***@***.***> escribió: > @suarezvictor <https://github.com/suarezvictor> I am running the tool > with an interpretation of your code > > Can you comment on my implementation? > > https://github.com/JulianKemmerer/PipelineC/blob/40ad626b1ec649aebe317dda2fdfe18e606e2d42/primitives/cyclone_iv.c#L61 > > And do you know the proper way to use this 27b multiplier in the context > of the 24b floating point mantissa? > See old commented out code for 24x24b=48b mult > > https://github.com/JulianKemmerer/PipelineC/blob/40ad626b1ec649aebe317dda2fdfe18e606e2d42/primitives/cyclone_iv.c#L140 > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#14 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACBHVWKRGCOUTG3SR6USCFLUD2FQ7ANCNFSM5EI2MBHQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > >

0 replies

suarezvictor · 2021-09-27T13:12:01Z

suarezvictor
Sep 27, 2021
Author

Good results achieved! Using the optimized 27x27 multiplier functions seems to achieve next to optimum performance: 326 Mhz out of the theoretical 340:


 llvm_dis_Z11float_rsqrtf Clock Goal: 1000.00 (MHz) Current: 293.77 (MHz)(3.40 ns) 92 clks
*llvm_dis_Z11float_rsqrtf Clock Goal: 1000.00 (MHz) Current: 326.58 (MHz)(3.06 ns) 93 clks
 llvm_dis_Z11float_rsqrtf Clock Goal: 1000.00 (MHz) Current: 312.79 (MHz)(3.20 ns) 95 clks
 ...
 llvm_dis_Z11float_rsqrtf Clock Goal: 1000.00 (MHz) Current: 297.35 (MHz)(3.36 ns) 142 clks
 Same or worse timing result... (best=326.5839320705422)

Now it seems it's time to make a coorect implementation (using just 24 bits of the 27 in output and input) and correctly adjusting the exponent

2 replies

JulianKemmerer Sep 28, 2021
Maintainer

Sounds good I will get back to ya when I try to synthesize your above example code

JulianKemmerer Sep 28, 2021
Maintainer

Also @suarezvictor do you have a reference for if ~93 clocks for a rsqrtf on a cyclone iv sounds reasonable? Ive never done these fp math calcs on FPGA before

suarezvictor · 2021-09-28T01:40:49Z

suarezvictor
Sep 28, 2021
Author

I'm un same situation. I'm interested on "results per seconds" metrics, and area usage (since it is posible to trade one for the other) and not on latency (we are below a microseconds). I've seen no other implementation of rsqrt on any FPGA and doubt it can be made faster than we are trying. El lun., 27 sep. 2021 22:32, Julian Kemmerer ***@***.***> escribió:

…

Also @suarezvictor <https://github.com/suarezvictor> do you have a reference for if ~93 clocks for a rsqrtf on a cyclone iv sounds reasonable? Ive never done these fp math calcs on FPGA before — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBHVWPCR5U7W6FZW2UNIR3UEELLBANCNFSM5EI2MBHQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

suarezvictor · 2021-09-28T01:43:22Z

suarezvictor
Sep 28, 2021
Author

Did you have opportunity to check results of the optimized fp32 multiplication? I wasent able to se how the native multiplier is instanced (no reference to the * operator un the generated VHDL). Hopefully this week I can do tests with cxxrtl: compiler and simulator results should match El lun., 27 sep. 2021 22:40, Victor Suarez Rovere ***@***.***> escribió:

…

I'm un same situation. I'm interested on "results per seconds" metrics, and area usage (since it is posible to trade one for the other) and not on latency (we are below a microseconds). I've seen no other implementation of rsqrt on any FPGA and doubt it can be made faster than we are trying. El lun., 27 sep. 2021 22:32, Julian Kemmerer ***@***.***> escribió: > Also @suarezvictor <https://github.com/suarezvictor> do you have a > reference for if ~93 clocks for a rsqrtf on a cyclone iv sounds reasonable? > Ive never done these fp math calcs on FPGA before > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#14 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACBHVWPCR5U7W6FZW2UNIR3UEELLBANCNFSM5EI2MBHQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > >

1 reply

JulianKemmerer Sep 28, 2021
Maintainer

I have not run those tests. I think next is like you said - get some inital tests going. Should start verifying basic operators first. Build up to more complicated modules.
One we get down to final rsqrtf code we know works and have passing tests for - its easier to start swapping out components and trying optimizations (and re-testing as needed).

JulianKemmerer · 2021-09-28T02:18:22Z

JulianKemmerer
Sep 28, 2021
Maintainer

Similar to how FP operations are implemented as PipelineC functions - so is the basic binary multiply algorithm (a bunch of shifts+adds). So since that is the first level of 'software' I think starting basic tests from there is smart. Verify uint mult, then fp mult+sub, then rsqrtf

0 replies

suarezvictor · 2021-09-28T03:48:01Z

suarezvictor
Sep 28, 2021
Author

Good to know our plan is clear El lun., 27 sep. 2021 23:18, Julian Kemmerer ***@***.***> escribió:

…

Similar to how FP operations are implemented as PipelineC functions - so is the basic binary multiply algorithm (a bunch of shifts+adds). So since that is the first level of 'software' I think starting basic tests from there is smart. Verify uint mult, then fp mult+sub, then rsqrtf — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBHVWNPPM33QZH2NAE345TUEEQXTANCNFSM5EI2MBHQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

suarezvictor · 2022-12-08T11:13:54Z

suarezvictor
Dec 8, 2022
Author

This code was generated by the chatGPT IA.
Prompt was "Implement a C function that implement 32 bit multiplication using just products of 8 bit terms"

#include <stdint.h>

uint32_t multiply_32_bit_with_8_bit_products(uint32_t x, uint32_t y) {
// Split the 32-bit numbers into 8-bit pieces
uint8_t x0 = x & 0xff;
uint8_t x1 = (x >> 8) & 0xff;
uint8_t x2 = (x >> 16) & 0xff;
uint8_t x3 = (x >> 24) & 0xff;

uint8_t y0 = y & 0xff;
uint8_t y1 = (y >> 8) & 0xff;
uint8_t y2 = (y >> 16) & 0xff;
uint8_t y3 = (y >> 24) & 0xff;

// Calculate the 8-bit products
uint16_t p0 = x0 * y0;
uint16_t p1 = x1 * y0;
uint16_t p2 = x2 * y0;
uint16_t p3 = x3 * y0;
uint16_t p4 = x0 * y1;
uint16_t p5 = x1 * y1;
uint16_t p6 = x2 * y1;
uint16_t p7 = x3 * y1;
uint16_t p8 = x0 * y2;
uint16_t p9 = x1 * y2;
uint16_t p10 = x2 * y2;
uint16_t p11 = x3 * y2;
uint16_t p12 = x0 * y3;
uint16_t p13 = x1 * y3;
uint16_t p14 = x2 * y3;
uint16_t p15 = x3 * y3;

// Calculate the intermediate sums
uint32_t s1 = p0 + (p4 << 8);
uint32_t s2 = p1 + (p5 << 8) + (p8 << 16);
uint32_t s3 = p2 + (p6 << 8) + (p9 << 16) + (p12 << 24);
uint32_t s4 = p3 + (p7 << 8) + (p10 << 16) + (p13 << 24);
uint32_t s5 = (p11 << 8) + (p14 << 16) + (p15 << 24);

// Calculate the final result
return s1 + (s2 << 16) + (s3 << 32) + (s4 << 48) + (s5 << 64);

}

0 replies

LLVM #14

suarezvictor Sep 18, 2021

Replies: 34 comments · 12 replies

JulianKemmerer Sep 20, 2021 Maintainer

suarezvictor Sep 21, 2021 Author

JulianKemmerer Sep 23, 2021 Maintainer

suarezvictor Sep 21, 2021 Author

suarezvictor Sep 23, 2021 Author

JulianKemmerer Sep 23, 2021 Maintainer

JulianKemmerer Sep 23, 2021 Maintainer

suarezvictor Sep 23, 2021 Author

suarezvictor Sep 23, 2021 Author

suarezvictor Sep 23, 2021 Author

suarezvictor Sep 23, 2021 Author

JulianKemmerer Sep 23, 2021 Maintainer

suarezvictor Sep 23, 2021 Author

JulianKemmerer Sep 24, 2021 Maintainer

JulianKemmerer Sep 24, 2021 Maintainer

JulianKemmerer Sep 24, 2021 Maintainer

JulianKemmerer Sep 25, 2021 Maintainer

suarezvictor Sep 25, 2021 Author

JulianKemmerer Sep 25, 2021 Maintainer

JulianKemmerer Sep 25, 2021 Maintainer

suarezvictor Sep 25, 2021 Author

suarezvictor Sep 25, 2021 Author

suarezvictor Sep 25, 2021 Author

JulianKemmerer Sep 26, 2021 Maintainer

suarezvictor Sep 25, 2021 Author

suarezvictor Sep 26, 2021 Author

suarezvictor Sep 26, 2021 Author

suarezvictor Sep 26, 2021 Author

JulianKemmerer Sep 26, 2021 Maintainer

JulianKemmerer Sep 26, 2021 Maintainer

suarezvictor Sep 26, 2021 Author

suarezvictor Sep 26, 2021 Author

suarezvictor Sep 27, 2021 Author

JulianKemmerer Sep 28, 2021 Maintainer

JulianKemmerer Sep 28, 2021 Maintainer

suarezvictor Sep 28, 2021 Author

suarezvictor Sep 28, 2021 Author

JulianKemmerer Sep 28, 2021 Maintainer

JulianKemmerer Sep 28, 2021 Maintainer

suarezvictor Sep 28, 2021 Author

suarezvictor Dec 8, 2022 Author

suarezvictor
Sep 18, 2021

Replies: 34 comments 12 replies

JulianKemmerer
Sep 20, 2021
Maintainer

suarezvictor
Sep 21, 2021
Author

JulianKemmerer Sep 23, 2021
Maintainer

suarezvictor
Sep 21, 2021
Author

suarezvictor
Sep 23, 2021
Author

JulianKemmerer Sep 23, 2021
Maintainer

JulianKemmerer Sep 23, 2021
Maintainer

suarezvictor
Sep 23, 2021
Author

suarezvictor
Sep 23, 2021
Author

suarezvictor
Sep 23, 2021
Author

suarezvictor
Sep 23, 2021
Author

JulianKemmerer
Sep 23, 2021
Maintainer

suarezvictor Sep 23, 2021
Author

JulianKemmerer
Sep 24, 2021
Maintainer

JulianKemmerer
Sep 24, 2021
Maintainer

JulianKemmerer
Sep 24, 2021
Maintainer

JulianKemmerer
Sep 25, 2021
Maintainer

suarezvictor
Sep 25, 2021
Author

JulianKemmerer Sep 25, 2021
Maintainer

JulianKemmerer Sep 25, 2021
Maintainer

suarezvictor
Sep 25, 2021
Author

suarezvictor
Sep 25, 2021
Author

suarezvictor
Sep 25, 2021
Author

JulianKemmerer Sep 26, 2021
Maintainer

suarezvictor
Sep 25, 2021
Author

suarezvictor
Sep 26, 2021
Author

suarezvictor
Sep 26, 2021
Author

suarezvictor
Sep 26, 2021
Author

JulianKemmerer Sep 26, 2021
Maintainer

JulianKemmerer
Sep 26, 2021
Maintainer

suarezvictor
Sep 26, 2021
Author

suarezvictor
Sep 26, 2021
Author

suarezvictor
Sep 27, 2021
Author

JulianKemmerer Sep 28, 2021
Maintainer

JulianKemmerer Sep 28, 2021
Maintainer

suarezvictor
Sep 28, 2021
Author

suarezvictor
Sep 28, 2021
Author

JulianKemmerer Sep 28, 2021
Maintainer

JulianKemmerer
Sep 28, 2021
Maintainer

suarezvictor
Sep 28, 2021
Author

suarezvictor
Dec 8, 2022
Author