Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes underflow reported by @hokb in PR #575 thanks to @thijssteel. #577

Closed

Conversation

weslleyspereira
Copy link
Collaborator

Closes #575.

This PR applies the solution from #575 (comment) to CLAHQR and ZLAHQR.

Thanks to @thijssteel!

@codecov
Copy link

codecov bot commented Jun 7, 2021

Codecov Report

Merging #577 (288d7cf) into master (53a49de) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #577   +/-   ##
=======================================
  Coverage   82.37%   82.37%           
=======================================
  Files        1894     1894           
  Lines      190679   190679           
=======================================
  Hits       157065   157065           
  Misses      33614    33614           
Impacted Files Coverage Δ
SRC/clahqr.f 95.75% <100.00%> (ø)
SRC/zlahqr.f 96.96% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53a49de...288d7cf. Read the comment docs.

Copy link

@hokb hokb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of ZLAHQR test ZEC runs fine with these changes on ifort. But here, it does not even enter lines 544 ... 548. On our compiler (where RTEMP underflows) it does recover successfully and completes this test iteration within zget37 (knt == 31). However, in subsequent iterations (knt == 49) the result from recovering from underflow in ZLAHQR causes subsequent steps in zget37 to fail (when computing condition numbers -> vmax becomes greater than the threshold). So, maybe the recovering is not yet sufficient or there is some other problem somewhere.
CLAHQR is also fun. Here, and when using 32 bit registers for ABS() it only partially underflows (only the real part when doing REAL(TEMP) * REAL(TEMP)). Hence, the outcome of ABS is wrong - but not RZERO. So, execution does not even enter line 543. We could test for both, the real part and the imaginary part individually. But somehow this path feels not optimal to me...

@weslleyspereira
Copy link
Collaborator Author

In the case of ZLAHQR test ZEC runs fine with these changes on ifort. But here, it does not even enter lines 544 ... 548. On our compiler (where RTEMP underflows) it does recover successfully and completes this test iteration within zget37 (knt == 31). However, in subsequent iterations (knt == 49) the result from recovering from underflow in ZLAHQR causes subsequent steps in zget37 to fail (when computing condition numbers -> vmax becomes greater than the threshold). So, maybe the recovering is not yet sufficient or there is some other problem somewhere.

Can you please indicate in which line the test fails? Is it inside ZTRSNA?

CLAHQR is also fun. Here, and when using 32 bit registers for ABS() it only partially underflows (only the real part when doing REAL(TEMP) * REAL(TEMP)). Hence, the outcome of ABS is wrong - but not RZERO. So, execution does not even enter line 543. We could test for both, the real part and the imaginary part individually. But somehow this path feels not optimal to me...

Yeah... I don't feel like it is optimal either. Another solution to this problem would be to use the BLAS routines SCNRM2 and DZNRM2 instead of ABS. These BLAS routines are guaranteed to not overflow or underflow. I think that, if we use them, we won't need to use the workaround proposed by @thijssteel in #575 (comment). Can you test CLAHQR with

            RTEMP = SCNRM2( TEMP )
            H( I, I-1 ) = RTEMP
            TEMP = TEMP / RTEMP

?

@hokb
Copy link

hokb commented Jun 8, 2021

In the case of ZLAHQR test ZEC runs fine with these changes on ifort. But here, it does not even enter lines 544 ... 548. On our compiler (where RTEMP underflows) it does recover successfully and completes this test iteration within zget37 (knt == 31). However, in subsequent iterations (knt == 49) the result from recovering from underflow in ZLAHQR causes subsequent steps in zget37 to fail (when computing condition numbers -> vmax becomes greater than the threshold). So, maybe the recovering is not yet sufficient or there is some other problem somewhere.

Can you please indicate in which line the test fails? Is it inside ZTRSNA?

It returns from ZTRSNA in line 408 with INFO = 0. But the comparison in line 416 fails for I = 4 (this test iteration is n = 5).

CLAHQR is also fun. Here, and when using 32 bit registers for ABS() it only partially underflows (only the real part when doing REAL(TEMP) * REAL(TEMP)). Hence, the outcome of ABS is wrong - but not RZERO. So, execution does not even enter line 543. We could test for both, the real part and the imaginary part individually. But somehow this path feels not optimal to me...

Yeah... I don't feel like it is optimal either. Another solution to this problem would be to use the BLAS routines SCNRM2 and DZNRM2 instead of ABS. These BLAS routines are guaranteed to not overflow or underflow. I think that, if we use them, we won't need to use the workaround proposed by @thijssteel in #575 (comment). Can you test CLAHQR with

            RTEMP = SCNRM2( TEMP )
            H( I, I-1 ) = RTEMP
            TEMP = TEMP / RTEMP

Bam! This makes all of CEC on XEIGTSTC pass. Genius!

    *
    *        Ensure that H(I,I-1) is real.
    *
             TEMP = H( I, I-1 )
             IF( AIMAG( TEMP ).NE.RZERO ) THEN
    *           OLD:  RTEMP = ABS( TEMP )
                RTEMP = SCNRM2(1, TEMP, 1)
                H( I, I-1 ) = RTEMP
                TEMP = TEMP / RTEMP
                IF( I2.GT.I )

@hokb
Copy link

hokb commented Jun 8, 2021

Tried DZNRM2 in ZLAHQR. But ZEC tests still fail at the same spot as before. It might be an unrelated issue, though. I figured that s and stmp contain NAN as element 4 and 5... I will try to track it down and report back.

@langou
Copy link
Contributor

langou commented Jun 8, 2021

Maybe instead of

RTEMP = SCNRM2(1, TEMP, 1)

Something like

RTEMP = SLAPY2( REAL(TEMP), AIMAG(TEMP) )

Ditto for D/Z.

Note that SLAPY2 should not be better in accuracy than SCNRM2. The suggestion is just a matter of preference. For a complex vector of length 1, I would prefer to call SLAPY2 than SCNRM2.

To be fair, I would prefer to call FORTRAN INTRINSIC ABS. It is frustrating that it seems that FORTRAN INTRINSIC ABS is the problem and is unnecessarily underflowing.

@weslleyspereira
Copy link
Collaborator Author

Tried DZNRM2 in ZLAHQR. But ZEC tests still fail at the same spot as before. It might be an unrelated issue, though. I figured that s and stmp contain NAN as element 4 and 5... I will try to track it down and report back.

Thank you! It will be very helpful if you can track the bug.

@hokb
Copy link

hokb commented Jun 8, 2021

Tried DZNRM2 in ZLAHQR. But ZEC tests still fail at the same spot as before. It might be an unrelated issue, though. I figured that s and stmp contain NAN as element 4 and 5... I will try to track it down and report back.

The problem here is a new issue, caused by ZTRSV overflowing in complex division in line 302:

                          IF (NOUNIT) TEMP = TEMP/DCONJG(A(J,J))   ! <--  gives -Infinity

TEMP is: (-3.0390314364021117e-160, 3.1073592763072852e-160)
A(J,J) is: (-1.2187537140200940e-160, 2.9549326388650510e-160)

Note, we have been very careful to implement complex division in exactly the same (algorithmic) way as is used in ifort. (It is actually the most straight - forward way.) So I am pretty sure, that this would overflow in FORTRAN as well, if the same numbers would be provided and the same register size used.
My guts feeling is that it will be rather hard to blame any specific part / technology here. FORTRAN does not specify how exactly things like LOG, SQRT etc. are to be implemented. So it feels natural that we see slightly different values - especially on complex numbers.
Any thoughts ?

@hokb
Copy link

hokb commented Jun 8, 2021

Maybe instead of

RTEMP = SCNRM2(1, TEMP, 1)

Something like

RTEMP = SLAPY2( REAL(TEMP), AIMAG(TEMP) )

Ditto for D/Z.

Note that SLAPY2 should not be better in accuracy than SCNRM2. The suggestion is just a matter of preference. For a complex vector of length 1, I would prefer to call SLAPY2 than SCNRM2.

I will give this a try and report back (tomorrow).

To be fair, I would prefer to call FORTRAN INTRINSIC ABS. It is frustrating that it seems that FORTRAN INTRINSIC ABS is the problem and is unnecessarily underflowing.

As stated above, I am struggling to clearly identify the problem. Neither ABS does something wrong nor the processor. But the only (straight forward) way to make everything succeed seems to be to adopt the exact same things as [name of your popular FORTRAN compiler] does ? (which is not our goal)

@weslleyspereira
Copy link
Collaborator Author

Tried DZNRM2 in ZLAHQR. But ZEC tests still fail at the same spot as before. It might be an unrelated issue, though. I figured that s and stmp contain NAN as element 4 and 5... I will try to track it down and report back.

The problem here is a new issue, caused by ZTRSV overflowing in complex division in line 302:

                          IF (NOUNIT) TEMP = TEMP/DCONJG(A(J,J))   ! <--  gives -Infinity

TEMP is: (-3.0390314364021117e-160, 3.1073592763072852e-160)
A(J,J) is: (-1.2187537140200940e-160, 2.9549326388650510e-160)

Note, we have been very careful to implement complex division in exactly the same (algorithmic) way as is used in ifort. (It is actually the most straight - forward way.) So I am pretty sure, that this would overflow in FORTRAN as well, if the same numbers would be provided and the same register size used.
My guts feeling is that it will be rather hard to blame any specific part / technology here. FORTRAN does not specify how exactly things like LOG, SQRT etc. are to be implemented. So it feels natural that we see slightly different values - especially on complex numbers.
Any thoughts ?

I would suggest that you try using ZLADIV( X, Y ), which computes X/Y without overflowing. But ZLADIV is from LAPACK, and ZTRSV is from BLAS, and BLAS shouldn't rely on LAPACK. Maybe you would be interested in comparing your implementation of complex division with the one in ZLADIV (which relies on DLADIV).

@langou
Copy link
Contributor

langou commented Jun 8, 2021

What about, in ZLAHQR, using ZLATRS instead of ZTRSV? (ZLATRS uses ZLADIV and does scaling to prevent overflow.)

@hokb
Copy link

hokb commented Jun 9, 2021

What about, in ZLAHQR, using ZLATRS instead of ZTRSV? (ZLATRS uses ZLADIV and does scaling to prevent overflow.)

This sounds tempting. There are multiple complex divisions in ZTRSV which potentially overflow. But ZTRSV is called by ZLATRS... Replacing our simple complex division with ZLADIV as suggested by @weslleyspereira does remove the overflow. Would this be an acceptable solution for you? To use ZLADIV in ZTRSV? (I will prepare an PR)
@weslleyspereira we had considered a more sophisticated /robust complex division as default implementation. But the current, simple version would be preferred, due to its general better compatibility with other FORTRAN compilers, performance, and to retain its rather simple semantics as a low-level operation.

@langou
Copy link
Contributor

langou commented Jun 9, 2021

Replacing our simple complex division with ZLADIV as suggested by @weslleyspereira does remove the overflow.

Good. Thanks for checking this.

But ZTRSV is called by ZLATRS

This is correct however ZLATRS is far more than a simple call to ZTRSV. In short ZLATRS calls ZTRSV when it thinks it is overflow-safe to do so. Otherwise it does the operations using ZLADIV. Since there is a significant performance difference between ZTRSV and ZLADIV, the goal is to use ZTRSV whenever possible. However I need to reread the code to know what ZLATRS think is safe. I think its range of safe is much wider than what is working on .NET. So indeed ZLATRS might need be helpful. It would be good to know whether vanilla ZLATRS does the trick or not though. I assume vanilla ZLATRS (instead of ZTRSV) is not working.

To repeat if, using .NET, we have Z1 = ( A1=1.0e-160, B1=1.0e-160 ) and Z2 = ( A2=1.0e-160, B2=1.0e-160 ) and then we do the operation Z1 / Z2 then we have an overflow. Is this correct?

Is the complex division algorithm used explicitly forming ( A2 **2 + B2 **2 ) by using a formula such as ( A1 + B1 ) * ( A2 - B2 ) / ( A2 **2 + B2 **2 )? (And then yes A2 **2 + B2 **2 underflows to 0 and then division by 0.)

I would like to understand better what should be expected from a compiler here. I was naively expecting Z1 / Z2 to be correctly computed as ( 1.0e+00, 0.0e+00 ).

@hokb: Can you give use some "simple" overflow/underflow that you see?
For Z1 = ( A1=2.0e-160, B1=2.0e-160 ), is ABS( Z1 ) = 0.0e+00?
For Z1 = ( A1=2.0e+160, B1=2.0e+160 ), is ABS( Z1 ) = Inf?
For Z1 = ( A1=1.0e-160, B1=1.0e-160 ) and Z2 = ( A2=1.0e-160, B2=1.0e-160 ), is Z1 / Z2 = Inf?
For Z1 = ( A1=1.0e+160, B1=1.0e+160 ) and Z2 = ( A2=1.0e+160, B2=1.0e+160 ), is Z1 / Z2 = 0?

I was naively expecting all these computation to be done correctly.

Replacing our simple complex division with ZLADIV as suggested by @weslleyspereira does remove the overflow. Would this be an acceptable solution for you? To use ZLADIV in ZTRSV? (I will prepare an PR)

I would wait a little.

@hokb
Copy link

hokb commented Jun 9, 2021

To repeat if, using .NET, we have Z1 = ( A1=1.0e-160, B1=1.0e-160 ) and Z2 = ( A2=1.0e-160, B2=1.0e-160 ) and then we do the operation Z1 / Z2 then we have an overflow. Is this correct?

It of course depends on how complex division is implemented. The naive formula (as used by ifort btw) does overflow on 64 bit registers. There is only one 'official' .NET Complex datatype. This uses Smith's formula and does not overflow, even when using SSE registers. Same thing with our (ILNumerics) complex and fcomplex datatypes. They use a similar algorithm which does not overflow. But for our FORTRAN compiler we try to mimic the behavior of common FORTRAN code as closely as possible. Hence, we implemented the naive formula (as in the following paragraph).

Is the complex division algorithm used explicitly forming ( A2 **2 + B2 **2 ) by using a formula such as ( A1 + B1 ) * ( A2 - B2 ) / ( A2 **2 + B2 **2 )? (And then yes A2 **2 + B2 **2 underflows to 0 and then division by 0.)

Either that or overflow happens in the division 1.0 / 1.99997773436537E-320.

I would like to understand better what should be expected from a compiler here. I was naively expecting Z1 / Z2 to be correctly computed as ( 1.0e+00, 0.0e+00 ).

This is exactly what I am trying to understand, too! I spent some time stepping through machine instructions generated by ifort. And I was surprised to see that it uses the naive (prone to overflow) formula for complex division. On the other hand, it does not overflow on above example. But this is only due to the 80 bit register used ...

@hokb: Can you give use some "simple" overflow/underflow that you see?
For Z1 = ( A1=2.0e-160, B1=2.0e-160 ), is ABS( Z1 ) = 0.0e+00?

Math.Sqrt(Z1.real * Z1.real + Z1.imag * Z1.imag), using SSE (64 bit) for * gives: 2.8284113805211334e-160

For Z1 = ( A1=2.0e+160, B1=2.0e+160 ), is ABS( Z1 ) = Inf?

=> gives: Positive Infinity

For Z1 = ( A1=1.0e-160, B1=1.0e-160 ) and Z2 = ( A2=1.0e-160, B2=1.0e-160 ), is Z1 / Z2 = Inf?

=> naive formula for Z1 / Z2, using SSE registers for *, gives: (∞+ NaN)

For Z1 = ( A1=1.0e+160, B1=1.0e+160 ) and Z2 = ( A2=1.0e+160, B2=1.0e+160 ), is Z1 / Z2 = 0?

=> naive formula, divisor becomes Infinity, gives (NaN+ NaN).

I was naively expecting all these computation to be done correctly.

We could commit to this expectation and use more robust algorithms. The only thing which puzzles me is to see that in the case of ifort (at these specific places) robustness seems to be realized by the use of larger registers. Since, as you know, .NET uses a JIT we cannot guarantee that eventually a specific register is used. The only guarantee is IEEE754. (I realize that this sounds like an argument for using more robust algorithms on our side...)

Replacing our simple complex division with ZLADIV as suggested by @weslleyspereira does remove the overflow. Would this be an acceptable solution for you? To use ZLADIV in ZTRSV? (I will prepare an PR)

I would wait a little.

ok

@langou
Copy link
Contributor

langou commented Jun 9, 2021

(1) I am sorry, I made a mistake. I should have written 1e-200 and not 1e-160. 1e-160 will go in the denormalized region. You can see that the results for Z1 = ( A1=2.0e-200, B1=2.0e-200 ); ABS( Z1 ) actually does not give 2*sqrt(2) very accurately. Here is 2*sqrt(2) and below the result

2.8284113805211333 
2.8284271247461903

So we do lose quite some precision by using in denormalized. (As expected.) Anyhow. Let me ask again. "For Z1 = ( A1=1.0e-200, B1=1.0e-200 ), is ABS( Z1 ) = 0.0e+00?"

(2) I do see how 80-bit can help here. Sure. 15 bits in the exponent for 80-bit registers (instead of 11 bits for 64-bit) will make quite a difference. With 15 bits in the exponent, the highest number is about 1e+4932 and the smallest positive about 1e-4931. Sure. That'll help with overflow and underflow. No need to use an "avoid unnecessary underflow/overflow" algorithm :). Most recent Intel machines do not have 80-bit registers. Do they? I think 80-bit register is a thing of the past. Are you sure that the improved accuracy that you see with ifort is coming from ifort using 80-bit register, as opposed to ifort using a more stable algorithm? I am sorry. I thought 80-bit registers were a thing of the past.

(3) I need to review quite a few things here. What does ZLADIV do? What does ZLATRS do? When is ZLATRS used in LAPACK (instead of ZTRSV)? What does the Fortran standard say for complex division and ABS()? What do Fortran compilers do for complex division and ABS()? How much of a slowdown is it to use ZLATRS (as opposed to ZTRSV). I am not up to speed on these topics.

Overall, I think .NET should use more robust algorithm for complex division and complex ABS().

I do not think ZLATRS is meant to cover this usage. So while ZLATRS might help, I do not think this will be bullet proof. In addition, I think the slowdown to go from ZLATRS to ZTRSV is too much.

I do not think that changing ZTRSV is a good idea to use ZLADIV(). I think it is OK that ZTRSV relies on complex division to not unnecessarily underflow / overflow. If we go that road, we would rewrite all complex divisions in LAPACK and all ABS() with our own functions, and I do not think we want to go that road.

I am open to discussion though. We can speak more.

@hokb
Copy link

hokb commented Jun 11, 2021

(1) I am sorry, I made a mistake. I should have written 1e-200 and not 1e-160. 1e-160 will go in the denormalized region. You can see that the results for Z1 = ( A1=2.0e-200, B1=2.0e-200 ); ABS( Z1 ) actually does not give 2*sqrt(2) very accurately. Here is 2*sqrt(2) and below the result

2.8284113805211333 
2.8284271247461903

So we do lose quite some precision by using in denormalized. (As expected.) Anyhow. Let me ask again. "For Z1 = ( A1=1.0e-200, B1=1.0e-200 ), is ABS( Z1 ) = 0.0e+00?"

Naive attempt: 0.0e+0.
Robust implementation: 1.4142135623730950e-200

(2) I do see how 80-bit can help here. Sure. 15 bits in the exponent for 80-bit registers (instead of 11 bits for 64-bit) will make quite a difference. With 15 bits in the exponent, the highest number is about 1e+4932 and the smallest positive about 1e-4931. Sure. That'll help with overflow and underflow. No need to use an "avoid unnecessary underflow/overflow" algorithm :). Most

Right, a valid and important point!

recent Intel machines do not have 80-bit registers. Do they? I think 80-bit register is a thing of the past. Are you sure that the improved accuracy that you see with ifort is coming from ifort using 80-bit register, as opposed to ifort using a more stable algorithm? I am sorry. I thought 80-bit registers were a thing of the past.

It looks like ifort does still make use of X87 instructions. (Haven't tried with their latest FORTRAN compiler, wich is released within Intels oneAPI, though)

       RTEMP = ABS( TEMP)
00007FF7466116E6 66 0F 10 85 60 04 00 00 movupd      xmm0,xmmword ptr [TEMP]  
00007FF7466116EE 66 0F 28 C8          movapd      xmm1,xmm0  
00007FF7466116F2 F2 0F 11 8D 10 05 00 00 movsd       mmword ptr [rbp+510h],xmm1  
00007FF7466116FA DD 85 10 05 00 00    fld         qword ptr [rbp+510h]  
00007FF746611700 D8 C8                fmul        st,st(0)  
00007FF746611702 66 0F 15 C0          unpckhpd    xmm0,xmm0  
00007FF746611706 F2 0F 11 85 10 05 00 00 movsd       mmword ptr [rbp+510h],xmm0  
00007FF74661170E DD 85 10 05 00 00    fld         qword ptr [rbp+510h]  
00007FF746611714 D8 C8                fmul        st,st(0)  
00007FF746611716 DE C1                faddp       st(1),st  
00007FF746611718 D9 FA                fsqrt  
00007FF74661171A DD 9D 70 04 00 00    fstp        qword ptr [RTEMP]  

(3) I need to review quite a few things here. What does ZLADIV do? What does ZLATRS do? When is ZLATRS used in LAPACK (instead of ZTRSV)? What does the Fortran standard say for complex division and ABS()? What do Fortran compilers do for complex division and ABS()? How much of a slowdown is it to use ZLATRS (as opposed to ZTRSV). I am not up to speed on these topics.

Overall, I think .NET should use more robust algorithm for complex division and complex ABS().

I do not think ZLATRS is meant to cover this usage. So while ZLATRS might help, I do not think this will be bullet proof. In addition, I think the slowdown to go from ZLATRS to ZTRSV is too much.

I do not think that changing ZTRSV is a good idea to use ZLADIV(). I think it is OK that ZTRSV relies on complex division to not unnecessarily underflow / overflow. If we go that road, we would rewrite all complex divisions in LAPACK and all ABS() with our own functions, and I do not think we want to go that road.

These are all very valuable statements. My take-away is that we will have to implement the more robust version of basic complex number handling. I was not sure about that and tried to identify the best place to put this responsibility. It is very reasonable now.
With the robust division and robust ABS implemented in our complex types the failing LAPACK tests are now reduced by ~50%. All of ZEC and CEC pass. We tried the original code - without the changes in this PO. So, we don't have to modify CLAHQR nor ZLAHQR. Thanks also to @weslleyspereira and @thijssteel for coming up with these suggestions!

@julielangou julielangou added this to the LAPACK 3.10.0 milestone Nov 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Processor requirements for LAPACK
4 participants