LLVM #14
Replies: 34 comments 12 replies
-
Oh @suarezvictor I didnt really get to look at this in detail until now. Not sure what running through LLVM was about? For optimizations eh? Would have to define/replace whatever built in LOAD and casting functions are doing... if you are wanting to write custom floating point functions (ex. rsqrt) that is defintely do-able. In fact the FP operators themselves (multiply, add, etc) are implemented as PipelineC code you could base future work off of float BIN_OP_MULT_float_float_float(float left, float right)
{
// Get mantissa exponent and sign for both
// LEFT
uint23_t x_mantissa;
x_mantissa = float_22_0(left);
uint9_t x_exponent_wide;
x_exponent_wide = float_30_23(left);
uint1_t x_sign;
x_sign = float_31_31(left);
// RIGHT
uint23_t y_mantissa;
y_mantissa = float_22_0(right);
uint9_t y_exponent_wide;
y_exponent_wide = float_30_23(right);
uint1_t y_sign;
y_sign = float_31_31(right);
// Declare the output portions
uint23_t z_mantissa;
uint8_t z_exponent;
uint1_t z_sign;
// Sign
z_sign = x_sign ^ y_sign;
// Multiplication with infinity = inf
if((x_exponent_wide==255) | (y_exponent_wide==255))
{
z_exponent = 255;
z_mantissa = 0;
}
// Multiplication with zero = zero
else if((x_exponent_wide==0) | (y_exponent_wide==0))
{
z_exponent = 0;
z_mantissa = 0;
z_sign = 0;
}
// Normal non zero|inf mult
else
{
// Delcare intermediates
uint1_t aux;
uint24_t aux2_x;
uint24_t aux2_y;
uint48_t aux2;
uint7_t BIAS;
BIAS = 127;
aux2_x = uint1_uint23(1, x_mantissa);
aux2_y = uint1_uint23(1, y_mantissa);
aux2 = aux2_x * aux2_y;
// args in Q23 result in Q46
aux = uint48_47_47(aux2);
if(aux) //if(aux == 1)
{
// >=2, shift left and add one to exponent
// HACKY NO ROUNDING + aux2(23); // with round18
z_mantissa = uint48_46_24(aux2);
}
else
{
// HACKY NO ROUNDING + aux2(22); // with rounding
z_mantissa = uint48_45_23(aux2);
}
// calculate exponent in parts
// do sequential unsigned adds and subs to avoid signed numbers for now
// X and Y exponent are already 1 bit wider than needed
// (0 & x_exponent) + (0 & y_exponent);
uint9_t exponent_sum = x_exponent_wide + y_exponent_wide;
exponent_sum = exponent_sum + aux;
exponent_sum = exponent_sum - BIAS;
// HACKY NOT CHECKING
// if (exponent_sum(8)='1') then
z_exponent = uint9_7_0(exponent_sum);
}
// Assemble output
return float_uint1_uint8_uint23(z_sign, z_exponent, z_mantissa);
} Currently that code is internally generated and used. Youll see that it doesnt use unions and bitfields which are a complicated when, as in hardware, you have no traditional 'layout of bytes in memory' folks might expect. Instead it uses some built in bit manipulation functions https://github.com/JulianKemmerer/PipelineC/wiki/Automatically-Generated-Functionality#bitmanip Let me know if you want any help, getting Quartus running to synthesize some stuff etc |
Beta Was this translation helpful? Give feedback.
-
I did the LLVM example to get an optimized version of a defined C
algorithm, for example 1/sqrt as shown. My plan is to run in pipeline C
with CXXRTL (and count number of cycles) and un that CXXRTL main.c file
also run the original algorithm and compare results are equal. You may make
that automatically when generating CXXRTL code. Another advantage of using
such optimized CXXRTL code is that it imposses less demand on sintax to
pipelineC (i.e. no need for unions).
The LOAD function just correaponds to a memory access and implementation
both in C and pipeline C is to return the argument (identity function) or
as a simple macro
# define LOAD(a) (a)
It was left there for clarity
Once such proposed system of comparing the results un simulator and real C,
a complete math library can be built.
Want operations with complex numbers? Overloaded * operators? Quaternions?
Implement it in C with templates or whatever you like: LLVM will convert it
to simple binary or unary operations on flotas easy to process with
pipelineC.
A tool I'm doing converts LLVM IR assembly output to simple (pipelineC
compatible) C.
I can continue improving that tool (next step is reuse of variables to
lower its count) and you can help me with al quirks with pipelineC
including passing it to quartus.
El lun., 20 sep. 2021 16:52, Julian Kemmerer ***@***.***>
escribió:
… Oh @suarezvictor <https://github.com/suarezvictor> I didnt really get to
look at this in detail until now.
Not sure what running through LLVM was about? For optimizations eh? Would
have to define/replace whatever built in LOAD and casting functions are
doing...
if you are wanting to write custom floating point functions (ex. rsqrt)
that is defintely do-able. In fact the FP operators themselves (multiply,
add, etc) implemented as PipelineC code you could base future work off of
float BIN_OP_MULT_float_float_float(float left, float right)
{
// Get mantissa exponent and sign for both
// LEFT
uint23_t x_mantissa;
x_mantissa = float_22_0(left);
uint9_t x_exponent_wide;
x_exponent_wide = float_30_23(left);
uint1_t x_sign;
x_sign = float_31_31(left);
// RIGHT
uint23_t y_mantissa;
y_mantissa = float_22_0(right);
uint9_t y_exponent_wide;
y_exponent_wide = float_30_23(right);
uint1_t y_sign;
y_sign = float_31_31(right);
// Declare the output portions
uint23_t z_mantissa;
uint8_t z_exponent;
uint1_t z_sign;
// Sign
z_sign = x_sign ^ y_sign;
// Multiplication with infinity = inf
if((x_exponent_wide==255) | (y_exponent_wide==255))
{
z_exponent = 255;
z_mantissa = 0;
}
// Multiplication with zero = zero
else if((x_exponent_wide==0) | (y_exponent_wide==0))
{
z_exponent = 0;
z_mantissa = 0;
z_sign = 0;
}
// Normal non zero|inf mult
else
{
// Delcare intermediates
uint1_t aux;
uint24_t aux2_x;
uint24_t aux2_y;
uint48_t aux2;
uint7_t BIAS;
BIAS = 127;
aux2_x = uint1_uint23(1, x_mantissa);
aux2_y = uint1_uint23(1, y_mantissa);
aux2 = aux2_x * aux2_y;
// args in Q23 result in Q46
aux = uint48_47_47(aux2);
if(aux) //if(aux == 1)
{
// >=2, shift left and add one to exponent
// HACKY NO ROUNDING + aux2(23); // with round18
z_mantissa = uint48_46_24(aux2);
}
else
{
// HACKY NO ROUNDING + aux2(22); // with rounding
z_mantissa = uint48_45_23(aux2);
}
// calculate exponent in parts
// do sequential unsigned adds and subs to avoid signed numbers for now
// X and Y exponent are already 1 bit wider than needed
// (0 & x_exponent) + (0 & y_exponent);
uint9_t exponent_sum = x_exponent_wide + y_exponent_wide;
exponent_sum = exponent_sum + aux;
exponent_sum = exponent_sum - BIAS;
// HACKY NOT CHECKING
// if (exponent_sum(8)='1') then
z_exponent = uint9_7_0(exponent_sum);
}
// Assemble output
return float_uint1_uint8_uint23(z_sign, z_exponent, z_mantissa);
}
Currently that code is internally generated and used. Youll see that it
doesnt use unions and bitfields which are a complicated when, as in
hardware, you have no traditional 'layout of bytes in memory' folks might
expect. Instead it uses some built in bit manipulation functions
https://github.com/JulianKemmerer/PipelineC/wiki/Automatically-Generated-Functionality#bitmanip
Let me know if you want any help, getting Quartus running to synthesize
some stuff etc
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACBHVWLFZGL5WUA6ANSJAZTUC6GJLANCNFSM5EI2MBHQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Forgot to say that to build such math library I can reuse code from
existing C and C++ libraries without any touch and that would include their
already written tests.
El mar., 21 sep. 2021 10:39, Victor Suarez Rovere ***@***.***>
escribió:
… I did the LLVM example to get an optimized version of a defined C
algorithm, for example 1/sqrt as shown. My plan is to run in pipeline C
with CXXRTL (and count number of cycles) and un that CXXRTL main.c file
also run the original algorithm and compare results are equal. You may make
that automatically when generating CXXRTL code. Another advantage of using
such optimized CXXRTL code is that it imposses less demand on sintax to
pipelineC (i.e. no need for unions).
The LOAD function just correaponds to a memory access and implementation
both in C and pipeline C is to return the argument (identity function) or
as a simple macro
# define LOAD(a) (a)
It was left there for clarity
Once such proposed system of comparing the results un simulator and real
C, a complete math library can be built.
Want operations with complex numbers? Overloaded * operators? Quaternions?
Implement it in C with templates or whatever you like: LLVM will convert it
to simple binary or unary operations on flotas easy to process with
pipelineC.
A tool I'm doing converts LLVM IR assembly output to simple (pipelineC
compatible) C.
I can continue improving that tool (next step is reuse of variables to
lower its count) and you can help me with al quirks with pipelineC
including passing it to quartus.
El lun., 20 sep. 2021 16:52, Julian Kemmerer ***@***.***>
escribió:
> Oh @suarezvictor <https://github.com/suarezvictor> I didnt really get to
> look at this in detail until now.
>
> Not sure what running through LLVM was about? For optimizations eh? Would
> have to define/replace whatever built in LOAD and casting functions are
> doing...
>
> if you are wanting to write custom floating point functions (ex. rsqrt)
> that is defintely do-able. In fact the FP operators themselves (multiply,
> add, etc) implemented as PipelineC code you could base future work off of
>
> float BIN_OP_MULT_float_float_float(float left, float right)
> {
> // Get mantissa exponent and sign for both
> // LEFT
> uint23_t x_mantissa;
> x_mantissa = float_22_0(left);
> uint9_t x_exponent_wide;
> x_exponent_wide = float_30_23(left);
> uint1_t x_sign;
> x_sign = float_31_31(left);
> // RIGHT
> uint23_t y_mantissa;
> y_mantissa = float_22_0(right);
> uint9_t y_exponent_wide;
> y_exponent_wide = float_30_23(right);
> uint1_t y_sign;
> y_sign = float_31_31(right);
>
> // Declare the output portions
> uint23_t z_mantissa;
> uint8_t z_exponent;
> uint1_t z_sign;
>
> // Sign
> z_sign = x_sign ^ y_sign;
>
> // Multiplication with infinity = inf
> if((x_exponent_wide==255) | (y_exponent_wide==255))
> {
> z_exponent = 255;
> z_mantissa = 0;
> }
> // Multiplication with zero = zero
> else if((x_exponent_wide==0) | (y_exponent_wide==0))
> {
> z_exponent = 0;
> z_mantissa = 0;
> z_sign = 0;
> }
> // Normal non zero|inf mult
> else
> {
> // Delcare intermediates
> uint1_t aux;
> uint24_t aux2_x;
> uint24_t aux2_y;
> uint48_t aux2;
> uint7_t BIAS;
> BIAS = 127;
>
> aux2_x = uint1_uint23(1, x_mantissa);
> aux2_y = uint1_uint23(1, y_mantissa);
> aux2 = aux2_x * aux2_y;
> // args in Q23 result in Q46
> aux = uint48_47_47(aux2);
> if(aux) //if(aux == 1)
> {
> // >=2, shift left and add one to exponent
> // HACKY NO ROUNDING + aux2(23); // with round18
> z_mantissa = uint48_46_24(aux2);
> }
> else
> {
> // HACKY NO ROUNDING + aux2(22); // with rounding
> z_mantissa = uint48_45_23(aux2);
> }
>
> // calculate exponent in parts
> // do sequential unsigned adds and subs to avoid signed numbers for now
> // X and Y exponent are already 1 bit wider than needed
> // (0 & x_exponent) + (0 & y_exponent);
> uint9_t exponent_sum = x_exponent_wide + y_exponent_wide;
> exponent_sum = exponent_sum + aux;
> exponent_sum = exponent_sum - BIAS;
>
> // HACKY NOT CHECKING
> // if (exponent_sum(8)='1') then
> z_exponent = uint9_7_0(exponent_sum);
> }
>
>
> // Assemble output
> return float_uint1_uint8_uint23(z_sign, z_exponent, z_mantissa);
> }
>
> Currently that code is internally generated and used. Youll see that it
> doesnt use unions and bitfields which are a complicated when, as in
> hardware, you have no traditional 'layout of bytes in memory' folks might
> expect. Instead it uses some built in bit manipulation functions
> https://github.com/JulianKemmerer/PipelineC/wiki/Automatically-Generated-Functionality#bitmanip
>
> Let me know if you want any help, getting Quartus running to synthesize
> some stuff etc
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#14 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACBHVWLFZGL5WUA6ANSJAZTUC6GJLANCNFSM5EI2MBHQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
|
Beta Was this translation helpful? Give feedback.
-
Register reuse optimization `
` |
Beta Was this translation helpful? Give feedback.
-
Nice to know you are already doing interesting optimizations. Would you elaborate how to estimate how many registers are used? As far as I understood you need to make a matrix with number of columns corresponding to amount maximum register usage at a determinate time, and rows corresponding to the latency. |
Beta Was this translation helpful? Give feedback.
-
(I mean by latency = pipeline depth) |
Beta Was this translation helpful? Give feedback.
-
in regards to register reusage making "longer pipelines", I think most times latency is not very relevant, we want "results per second". As far as I understood, making pipeline depth larger doesn't affect throughput, and avoiding parallelism saves resource usage. I am understanding? how to control such tradeoff with PipelineC? In the project I'm thinking I prefer to save resource usage, maybe reusing registers is the way. |
Beta Was this translation helpful? Give feedback.
-
for example I'd like to pass both codes (with register usage and without) and see how resource usage and latency is affected |
Beta Was this translation helpful? Give feedback.
-
@suarezvictor what FPGA are you targeting? Can you give me your full part string?
|
Beta Was this translation helpful? Give feedback.
-
@suarezvictor I plan on reporting on more details but after fixing some Quartus things you could try it yourself too After seeing how the --coarse sweep goes you might be interested in supplying the --start and/or --sweep args (to limit/expand scope of latency sweep). And if you have an actual FMAX in mind you can try setting MAIN_MHZ pragma in the c code and running without --coarse. Though if the design is small you might get better results with --coarse. Lots of experimenting/tuning to try. More info to come of pipeline hardware details |
Beta Was this translation helpful? Give feedback.
-
@suarezvictor Do you have a Quartus paid license? (maybe dev board includes one?) I.e. Not the 'Lite' version of quartus?
Alternatively you can always try getting results not using DSPs - FPGA fabric only. What do you think about this situation? |
Beta Was this translation helpful? Give feedback.
-
Btw you can now switch between the styles of multiplier used globally with --mult infer or --mult fabric now (needed for internally generated float mult C code at the moment) |
Beta Was this translation helpful? Give feedback.
-
@suarezvictor I have essentially done #2/3 for you in some stuff inside https://github.com/JulianKemmerer/PipelineC/blob/master/primitives/cyclone_iv.c
We can talk long term about getting this 'built into the compiler' |
Beta Was this translation helpful? Give feedback.
-
Tour advances are great. I'm trying to allocate a bit of time to test it
all.
Some ideas for improvements are the following:
-Use 9x9 multipliers that maybe use less area, takes less time and permit
more registers insertion
-Not use a 24x24 to 48 multiplier function since the mantisa is a fixed
point representation and we dont'n need the 24 least significante bits (so
no need to calculate them)
Let's say we need to multiply two mantisa values A and B, 24 bit each:
Value a can be represented as 1.abc with a b and c being 8 bit valies. B is
1.xyz, same width.
Then you can mult:
A*B=(a*x+a*y+a*z+b*x+b*y+c*x) that is not calculating least significant
term b*z+c*y+c*z. Using the full 9 bits instead of 8 will solve the carry
issues from the least significant term.
Hopefully you grasp the idea if not I can do an implementation.
I beg your patient until I finish some paid work and can get to this.
El sáb., 25 sep. 2021 02:04, Julian Kemmerer ***@***.***>
escribió:
… @suarezvictor <https://github.com/suarezvictor> I have essentially done #2
<#2>/3 for you in some
stuff inside
https://github.com/JulianKemmerer/PipelineC/blob/master/primitives/cyclone_iv.c
./src/pipelinec ./examples/llvm/rsqrtf.c --coarse should be a pretty
representation of of the types of pipeline fmaxs and area to expect (though
still some optimizations for fewer dsps - Karatusba algorithm, 1/4 fewer
dsps could be done it seems)
We can talk long term about getting this 'built into the compiler'
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACBHVWJZQ64C42BVFEHLFYDUDVJ5XANCNFSM5EI2MBHQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Oh, and I don't plan to use the paid versions of the tools, indeed my plan
is to try to avandon such closed source tools as soon as posible even if I
need to use chips from other vendors like Lattice
El sáb., 25 sep. 2021 13:49, Victor Suarez Rovere ***@***.***>
escribió:
… Tour advances are great. I'm trying to allocate a bit of time to test it
all.
Some ideas for improvements are the following:
-Use 9x9 multipliers that maybe use less area, takes less time and permit
more registers insertion
-Not use a 24x24 to 48 multiplier function since the mantisa is a fixed
point representation and we dont'n need the 24 least significante bits (so
no need to calculate them)
Let's say we need to multiply two mantisa values A and B, 24 bit each:
Value a can be represented as 1.abc with a b and c being 8 bit valies. B
is 1.xyz, same width.
Then you can mult:
A*B=(a*x+a*y+a*z+b*x+b*y+c*x) that is not calculating least significant
term b*z+c*y+c*z. Using the full 9 bits instead of 8 will solve the carry
issues from the least significant term.
Hopefully you grasp the idea if not I can do an implementation.
I beg your patient until I finish some paid work and can get to this.
El sáb., 25 sep. 2021 02:04, Julian Kemmerer ***@***.***>
escribió:
> @suarezvictor <https://github.com/suarezvictor> I have essentially done
> #2 <#2>/3 for you in
> some stuff inside
> https://github.com/JulianKemmerer/PipelineC/blob/master/primitives/cyclone_iv.c
>
> ./src/pipelinec ./examples/llvm/rsqrtf.c --coarse should be a pretty
> representation of of the types of pipeline fmaxs and area to expect (though
> still some optimizations for fewer dsps - Karatusba algorithm, 1/4 fewer
> dsps could be done it seems)
>
> We can talk long term about getting this 'built into the compiler'
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#14 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACBHVWJZQ64C42BVFEHLFYDUDVJ5XANCNFSM5EI2MBHQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
|
Beta Was this translation helpful? Give feedback.
-
` r = (XXa + Xb + c) * (XXd + Xe + f) = XXXXad + XXX(ae+bd) + XX*(af+be+cd) + X*(bf+ce) + cf assuming 9-bit numbers (X = 1<<9), and ignoring least significant terms bf, ce, cf (i.e truncating), and dividing by XX (common shift of all numbers), then r' = r/XX = XXad + X(ae+bd) + (af+be+cd) note that 9 least significant bits of cd, be, af can be ignored (doesn't contribute to the truncated value) so 27x27 bit multiplication with 27 bit results become: uint9_t cd = MSB_9(c * d); uint18_t bd = b * d; uint18_t ad = a * d; this may work |
Beta Was this translation helpful? Give feedback.
-
In case of Cyclone IV devices, fastest grade (C6 models) a 9×9-bit multiplier should reach 340 MHz and a 18×18-bit multiplier, 287 MHz |
Beta Was this translation helpful? Give feedback.
-
Idea: considering that 3 bits would be discarded from the 27, first multipliers needs to be of just 3x3 bits and 3 bit results (i.e. 6 bits LUT should be enough). That may be better implemented in fabric thus half of multipliers are saved |
Beta Was this translation helpful? Give feedback.
-
Below are logic equations to form a 3x3 bit fixed point multiplier using just logic.
Calculated with:
|
Beta Was this translation helpful? Give feedback.
-
Sorry, I couldn't test it with actual hardware just as cause of lack of
time. I'm eaguer to do all tests regarding our discussions but I'm with
pending work yet.
El sáb., 25 sep. 2021 21:55, Julian Kemmerer ***@***.***>
escribió:
… @suarezvictor <https://github.com/suarezvictor> Do you have a cyclone iv
rsqrtf to compare to? fmax+resources? (Does it get near those max fmax
values you list?)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACBHVWI6X6GUL3MX4MLVOELUDZVRHANCNFSM5EI2MBHQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
I just posted fmax of the multipliers as listed on the datasheet so if
implementation reach those values, there's nothing to optimize
El sáb., 25 sep. 2021 21:55, Julian Kemmerer ***@***.***>
escribió:
… @suarezvictor <https://github.com/suarezvictor> Do you have a cyclone iv
rsqrtf to compare to? fmax+resources? (Does it get near those max fmax
values you list?)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACBHVWI6X6GUL3MX4MLVOELUDZVRHANCNFSM5EI2MBHQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
@suarezvictor I am running the tool with an interpretation of your code Can you comment on my implementation? PipelineC/primitives/cyclone_iv.c Line 61 in 40ad626 And do you know the proper way to use this 27b multiplier in the context of the 24b floating point mantissa? PipelineC/primitives/cyclone_iv.c Line 140 in 40ad626 |
Beta Was this translation helpful? Give feedback.
-
To use 27 bit multiplier you should use same technique as with 48 bit code:
only use the 24 most significant bits of results. And obviously when
passing 27 bit arguments, do a 3 bit shift left to such 24 bit arguments.
uint24_t aux2_x = uint1_uint23(1, x_mantissa);
uint24_t aux2_y = uint1_uint23(1, y_mantissa);
#if 0 // previous 48 bit code
uint48_t aux2 = LPM_MULT24X24(aux2_x, aux2_y);
If(msb(aux2)) {aux2<<=1; --exp;}
uint23_t z_mantissa = uint48_46_24(aux2);
#else //27 bit code
uint27_t aux2 = mult27x27(aux2_x <<3, aux2_y <<3);
If(msb(aux2)) {aux2>>=1; ++exp;}
uint23_t z_mantissa = uint48_25_3(aux2); //discard always-1 MSB and use 23
bit MSB
#endif
Please consider this is nos tested, only general idea.
Main thin is this: mantissa is a fixed point representations of 1.x values
(one plus a number between 0 and 1). Using more bits only add precision to
the results but positions remanin the same (msb bit is always 1, second msb
represents 0.5, third 0.25...)
El dom., 26 sep. 2021 00:12, Julian Kemmerer ***@***.***>
escribió:
… @suarezvictor <https://github.com/suarezvictor> I am running the tool
with an interpretation of your code
Can you comment on my implementation?
https://github.com/JulianKemmerer/PipelineC/blob/40ad626b1ec649aebe317dda2fdfe18e606e2d42/primitives/cyclone_iv.c#L61
And do you know the proper way to use this 27b multiplier in the context
of the 24b floating point mantissa?
See old commented out code for 24x24b=48b mult
https://github.com/JulianKemmerer/PipelineC/blob/40ad626b1ec649aebe317dda2fdfe18e606e2d42/primitives/cyclone_iv.c#L140
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACBHVWKRGCOUTG3SR6USCFLUD2FQ7ANCNFSM5EI2MBHQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
PS: sorry for the error ajusting exponents, be careful!
El dom., 26 sep. 2021 11:11, Victor Suarez Rovere ***@***.***>
escribió:
… To use 27 bit multiplier you should use same technique as with 48 bit
code: only use the 24 most significant bits of results. And obviously when
passing 27 bit arguments, do a 3 bit shift left to such 24 bit arguments.
uint24_t aux2_x = uint1_uint23(1, x_mantissa);
uint24_t aux2_y = uint1_uint23(1, y_mantissa);
#if 0 // previous 48 bit code
uint48_t aux2 = LPM_MULT24X24(aux2_x, aux2_y);
If(msb(aux2)) {aux2<<=1; --exp;}
uint23_t z_mantissa = uint48_46_24(aux2);
#else //27 bit code
uint27_t aux2 = mult27x27(aux2_x <<3, aux2_y <<3);
If(msb(aux2)) {aux2>>=1; ++exp;}
uint23_t z_mantissa = uint48_25_3(aux2); //discard always-1 MSB and use 23
bit MSB
#endif
Please consider this is nos tested, only general idea.
Main thin is this: mantissa is a fixed point representations of 1.x values
(one plus a number between 0 and 1). Using more bits only add precision to
the results but positions remanin the same (msb bit is always 1, second msb
represents 0.5, third 0.25...)
El dom., 26 sep. 2021 00:12, Julian Kemmerer ***@***.***>
escribió:
> @suarezvictor <https://github.com/suarezvictor> I am running the tool
> with an interpretation of your code
>
> Can you comment on my implementation?
>
> https://github.com/JulianKemmerer/PipelineC/blob/40ad626b1ec649aebe317dda2fdfe18e606e2d42/primitives/cyclone_iv.c#L61
>
> And do you know the proper way to use this 27b multiplier in the context
> of the 24b floating point mantissa?
> See old commented out code for 24x24b=48b mult
>
> https://github.com/JulianKemmerer/PipelineC/blob/40ad626b1ec649aebe317dda2fdfe18e606e2d42/primitives/cyclone_iv.c#L140
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#14 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACBHVWKRGCOUTG3SR6USCFLUD2FQ7ANCNFSM5EI2MBHQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
|
Beta Was this translation helpful? Give feedback.
-
Good results achieved! Using the optimized 27x27 multiplier functions seems to achieve next to optimum performance: 326 Mhz out of the theoretical 340:
Now it seems it's time to make a coorect implementation (using just 24 bits of the 27 in output and input) and correctly adjusting the exponent |
Beta Was this translation helpful? Give feedback.
-
I'm un same situation. I'm interested on "results per seconds" metrics, and
area usage (since it is posible to trade one for the other) and not on
latency (we are below a microseconds). I've seen no other implementation of
rsqrt on any FPGA and doubt it can be made faster than we are trying.
El lun., 27 sep. 2021 22:32, Julian Kemmerer ***@***.***>
escribió:
… Also @suarezvictor <https://github.com/suarezvictor> do you have a
reference for if ~93 clocks for a rsqrtf on a cyclone iv sounds reasonable?
Ive never done these fp math calcs on FPGA before
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACBHVWPCR5U7W6FZW2UNIR3UEELLBANCNFSM5EI2MBHQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Did you have opportunity to check results of the optimized fp32
multiplication? I wasent able to se how the native multiplier is instanced
(no reference to the * operator un the generated VHDL).
Hopefully this week I can do tests with cxxrtl: compiler and simulator
results should match
El lun., 27 sep. 2021 22:40, Victor Suarez Rovere ***@***.***>
escribió:
… I'm un same situation. I'm interested on "results per seconds" metrics,
and area usage (since it is posible to trade one for the other) and not on
latency (we are below a microseconds). I've seen no other implementation of
rsqrt on any FPGA and doubt it can be made faster than we are trying.
El lun., 27 sep. 2021 22:32, Julian Kemmerer ***@***.***>
escribió:
> Also @suarezvictor <https://github.com/suarezvictor> do you have a
> reference for if ~93 clocks for a rsqrtf on a cyclone iv sounds reasonable?
> Ive never done these fp math calcs on FPGA before
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#14 (reply in thread)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACBHVWPCR5U7W6FZW2UNIR3UEELLBANCNFSM5EI2MBHQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
|
Beta Was this translation helpful? Give feedback.
-
Similar to how FP operations are implemented as PipelineC functions - so is the basic binary multiply algorithm (a bunch of shifts+adds). So since that is the first level of 'software' I think starting basic tests from there is smart. Verify uint mult, then fp mult+sub, then rsqrtf |
Beta Was this translation helpful? Give feedback.
-
Good to know our plan is clear
El lun., 27 sep. 2021 23:18, Julian Kemmerer ***@***.***>
escribió:
… Similar to how FP operations are implemented as PipelineC functions - so
is the basic binary multiply algorithm (a bunch of shifts+adds). So since
that is the first level of 'software' I think starting basic tests from
there is smart. Verify uint mult, then fp mult+sub, then rsqrtf
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACBHVWNPPM33QZH2NAE345TUEEQXTANCNFSM5EI2MBHQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
This code was generated by the chatGPT IA. #include <stdint.h> uint32_t multiply_32_bit_with_8_bit_products(uint32_t x, uint32_t y) {
} |
Beta Was this translation helpful? Give feedback.
-
I was able to compile a C function with LLVM (optimized output), then parse the assembly and convert to C in a compatible fashion with PipelineC (indeed generated code compiles). Example: a "fast" inverse square root function:
SOURCE:
float float_rsqrt( float number )
{
const float x2 = number * float_rsqrt_K0; //0.5f;
const float threehalfs = float_rsqrt_K1; //1.5f;
union {
float f;
uint32_t i;
} conv = { .f = number };
conv.i = float_rsqrt_K2 - ( conv.i >> 1 ); //0x5f3759df = 1597463007
conv.f *= threehalfs - ( x2 * conv.f * conv.f );
return conv.f;
}
AFTER PARSING LLVM IR OUTPUT AND GENERATING C
float llvm_dis_Z11float_rsqrtf( float a0)
{
float a2 = LOAD(llvm_dis_float_rsqrt_K0); // %2 = load float, float* @float_rsqrt_K0, align 4, !tbaa !3
float a3 = a2 * a0; // %3 = fmul float %2, %0
float a4 = LOAD(llvm_dis_float_rsqrt_K1); // %4 = load float, float* @float_rsqrt_K1, align 4, !tbaa !3
uint32_t a5 = BITCAST_I32(a0); // %5 = bitcast float %0 to i32
uint32_t a6 = LOAD(llvm_dis_float_rsqrt_K2); // %6 = load i32, i32* @float_rsqrt_K2, align 4, !tbaa !7
uint32_t a7 = a5 >> 1; // %7 = lshr i32 %5, 1
uint32_t a8 = a6 - a7; // %8 = sub i32 %6, %7
float a9 = BITCAST_FLOAT(a8); // %9 = bitcast i32 %8 to float
float a10 = a3 * a9; // %10 = fmul float %3, %9
float a11 = a10 * a9; // %11 = fmul float %10, %9
float a12 = a4 - a11; // %12 = fsub float %4, %11
float a13 = a12 * a9; // %13 = fmul float %12, %9
return a13; // ret float %13
}
No luck yet simulating it with CXXRTL/GHDL or even synthetizing with Quartus...
It seems that PipelineC isn't smart enough to use less than the 14 variables (not all intermediate values reach to the end of the pipeline) but that may be solved with another optimization path with LLVM or done manually ar the code generation phase.
Beta Was this translation helpful? Give feedback.
All reactions