Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long double floating point calculations give inexact results #830

Closed
bewantbe opened this issue Aug 9, 2016 · 54 comments
Closed

long double floating point calculations give inexact results #830

bewantbe opened this issue Aug 9, 2016 · 54 comments

Comments

@bewantbe
Copy link

bewantbe commented Aug 9, 2016

This program

// g++ -std=c++11 -static -o m24_11 m24.cpp ; ./m24_11
int main()
{
  long double t = 1;
  printf("%.17Le\n", t / 100000);
}

outputs (Linux, expected result):
1.00000000000000000e-05

But in BashOnWindows, it outputs:
1.00000000000000008e-05

The long double data type in C/C++ under Linux using GCC for x86/x86-64 platform has an 80-bit format with 64 bits mantissa.
When a program using long double runs in BashOnWindows it behaves like having 53 bits mantissa number just like the usual "double" (but with almost the same exponential range as long double)

This harms functions like std::pow(long double, int) and hence boost::lexical_cast<double>(char*) (boost version 1.55) since internally it uses long double as intermediate result (but boost version 1.60 fixes this problem), therefore boost::program_options can reads double type options inaccurately also (even for exact input e.g. 0.03125 = 1/32).

Likewise this program:

// g++ -std=c++11 -static -o m25 m25.cpp ; ./m25
#include <stdio.h>
#include <limits>
int main()
{
  long double t0, t1 = 1.0;
  while (t1 > 0) {
    t0 = t1;
    t1 = t0 / 2;
  }
  long double lt = std::numeric_limits<long double>::denorm_min();
  if (t0 == lt) {
    printf("Good.\n");
  } else {
    printf("What? t0 = %.17Le != %.17Le\n", t0, lt);
  }
}

Should output:
Good.
While in BashOnWindows, it output:
What? t0 = 7.46536864129530799e-4948 != 3.64519953188247460e-4951

@benhillis
Copy link
Member

Are you sure you're using the same version of g++ on WSL versus your Linux machine?

@therealkenc
Copy link
Collaborator

therealkenc commented Aug 9, 2016

Confirmed it happens with consistent g++ version (6.1.1 20160510). Same ELF64 object generates byte-for-byte on native and WSL. It doesn't even call any math library functions. Looking at the assembly, everything that calculates t0 is inline, and it only escapes to call denorm_min() and printf()/puts(), let alone leaves userspace for math. strace is the same on both sides. No idea.... 🤷

@aseering
Copy link
Contributor

aseering commented Aug 9, 2016

It's been a while since I looked at this sort of thing (pre-x86_64), but IIRC Windows and Linux have slightly different calling conventions regarding saving things like the FPU control registers during/after syscalls? Maybe relevant?

@stehufntdev
Copy link
Collaborator

stehufntdev commented Aug 9, 2016

Thanks for reporting the issue. Just looked at this, and it is a difference between the default floating point control register (i.e. FPCSR\NPXCW) between Linux and NT. glibc assumes 0x37f but NT uses 0x27f. To confirm, I added the following lines to the sample above to set the control register and it passed:

#include <fpu_control.h>
...
   unsigned short Cw = 0x37f;
  _FPU_SETCW(Cw);

MxCsr is the same between Linux and NT, but I'll poke around to see if there are other control registers that are mismatched. I'll file a bug on our side to match the default configuration on Linux.

@hrlngrv
Copy link

hrlngrv commented Aug 10, 2016

FTHOI, I compiled the sample program in Linux, then copied the resulting executable to WSL. It produced different results under Linux and WSL, 1.00000000000000000e-05 under Linux but 1.00000000000000008e-05 under WSL. It's not g++ that's the problem.

Does WSL not support 80-bit floating point? Does WSL not use the FPU but only provide 64-bit software floating point math? FWLIW, awk under Linux produces the same output as Windows.

@benhillis
Copy link
Member

See the comment above yours. There is a difference between the floating point control register on Windows vs Linux. Steve is working on a fix.

@bewantbe
Copy link
Author

@stehufntdev Thank you for the temporary solution (especially the magic number!!! :-)) and the on going work. With that fix, my (some other) computation executable now gives the same results under BashOnWindows as under Linux.

@therealkenc Thanks for the clarification, that's exactly what happend.

Just for completeness
Windows version:
10.0.14393
The test programs are produced by either
GCC 4.9, libc 2.19 (Debian 8.5 (jessie))
GCC 5.4, libc 2.23 (Debian testing (stretch))
with all libs statically linked.

@richfelker
Copy link

Bumping this because it's a serious ABI issue and completely breaks any musl libc-linked binaries that use floating point scanf/strtod, among other things. It probably also breaks printf rounding of floating point values. We use long double internally for various reasons, including issues related to excess precision (on archs that have FLT_EVAL_METHOD>0), and the cost is irrelevant since the bulk of the code actually works with integers, but of course it badly breaks when run with an ABI-non-conforming control word setting or an emulator that fails to emulate x87 correctly.

@benhillis
Copy link
Member

@richfelker - Thanks for pinging this thread, this issue is still on your backlog. If possible could you describe a specific scenario that is not working because of this to help us prioritize this fix?

@richfelker
Copy link

OK, I'll work out a specific strtod call that returns wrong result with x87 control word set incorrectly.

@fweimer
Copy link

fweimer commented May 5, 2017

@benhillis, I think several tests in the glibc test suite (e.g., math/test-ldouble, stdlib/tst-strtod-round) fail due to this issue.

@richfelker
Copy link

With the control word set wrong, strtod("4398046511105",0) returns 4398046511104. This represents the exact boundary I expected to break: values with more than 53-(64-53) = 42 bits in the significand.

@therealkenc
Copy link
Collaborator

I still think this is going to have a NodeJS repro if someone in the small intersection of coders that do numerical computation stuff and ECMAScript stuff (not me) wants to take a crack at it. The interpreter (JIT compiled or otherwise) is going to make hard assumptions about the floating point mode. Might be motivational.

@ctsa
Copy link

ctsa commented May 23, 2018

+1

We just tried to transfer one of our projects onto WSL (https://github.com/Illumina/strelka), and traced failing unit tests down to differences like this within boost::math::hypergeomtric_distribution<>:

#include <iostream>
#include <iomanip>

int main()
{
    long double n(7.59070505394721872907e+286);
    long double d(2.00440157654530257755e+291);

    // linux n = 3.78701810194652734355e-05
    // WSL  n = 3.78701810194652746001e-05
    n /= d;

    std::cout.precision(21);
    std::cout << "n = " << n << "\n";
}

@ghost
Copy link

ghost commented May 28, 2018

@benhillis, this is about to be a two years old issue. Can we have some updates? Something we can do to help you guys fixing it?

@ethanherbertson
Copy link

I just ran into this (or a similar) issue while performance testing two different ways to do a calculation in Node.js.

On WSL the calculation started producing an incorrect result after a few thousand iterations, whereas on a RHEL virtual machine it was always correct (at least through 1,000,000 iterations). The specific iteration where the first error occurred was consistent as long as the script was unchanged, but it would vary somewhat if small changes were introduced into the script.

Here is one such script:

var sqrtfive = Math.sqrt(5);
var phi = (1 + sqrtfive) / 2;
var fibc = function (n) { return Math.round((Math.pow(phi, n)) / sqrtfive); };

var output = 0;
var correct_output = fibc(75);
var num_errors = 0;
for (var x = 0; x < 100000; x++) {
    output = fibc(75);
    if (output != correct_output) {
        num_errors++;
        console.log('error at x = ' + x);
        if (num_errors > 25) {
            break;
        }
    }
}
console.log(correct_output);
console.log(output); 

Running the above through node v8.11.3 prints, on my machine:

error at x = 11644
error at x = 15526
error at x = 15527
error at x = 15528
error at x = 15529
error at x = 15530
error at x = 15531
error at x = 15532
error at x = 15533
error at x = 15534
error at x = 15535
error at x = 15536
error at x = 15537
error at x = 15538
error at x = 15539
error at x = 15540
error at x = 15541
error at x = 15542
error at x = 15543
error at x = 15544
error at x = 15545
error at x = 15546
error at x = 15547
error at x = 15548
error at x = 15549
error at x = 15550
2111485077978050
2111485077978055

(For the record, 2111485077978050 is the correct answer for Fib(75).)

@richfelker
Copy link

Ping? Back in October I had indication out-of-band that a fix for this was coming, but I haven't seen anything about it since.

@luoqi-git
Copy link

It looks like a pretty old ticket, is there any plan to resolve it? I image the fix is very straightforward, just initialize the right word in the process control block/context of the first linux process.

@luoqi-git
Copy link

luoqi-git commented Mar 13, 2019

Ok, this problem is more complicated (and more serious) than I thought -- the x87 control word is NOT inherited by a child process across fork(), which means there is no easy user space workaround by writing a wrapper program to properly initialize the control word then fork and exec. I believe this inheritance is required for POSIX compliance.

% cat t.c
#include <fenv.h>
#include <math.h>
#include <pthread.h>
#include <setjmp.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <ucontext.h>
#include <unistd.h>
#include <xmmintrin.h>

static unsigned short _mm_getfcw()
{
        unsigned short word;
        asm volatile("fstcw %0" : "=m" (*(&word)));
        return word;
}

static void _mm_setfcw(unsigned short word)
{
        asm volatile("fldcw %0" : : "m" (*(&word)));
}

__attribute__((constructor))
static void _mm_fpuinit()
{
        asm volatile("finit");
}

static void report(const char *name)
{
        printf("%s\tfcw %x csr %x\n", name, _mm_getfcw(), _mm_getcsr());
}

static jmp_buf state;

static void sigfpe(int sig, siginfo_t *info, void *uap)
{
        ucontext_t *uc = uap;
        fpregset_t fp = uc->uc_mcontext.fpregs;

        if (!fp) {
                printf("sigfpe\tno fpu context\n");
        } else {
                printf("sigfpe\tfcw %x csr %x\n", fp->cwd, fp->mxcsr);
        }
        longjmp(state, 1);
}

int main(int argc, char **argv)
{
        report("initial");

        fesetround(FE_TOWARDZERO);
        _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

        report("parent");

        {
                pthread_t thr;
                pthread_create(&thr, 0, (void *(*)(void*))report, "thread");
                pthread_join(thr, 0);
        }

        if (fork() == 0) {
                report("child");
                if (setjmp(state) == 0) {
                        long double x = 0;
                        struct sigaction sa;
                        memset(&sa, 0, sizeof sa);
                        sa.sa_sigaction = sigfpe;
                        sa.sa_flags = SA_SIGINFO;
                        sigaction(SIGFPE, &sa, 0);
                        feenableexcept(FE_DIVBYZERO);
                        x = 1 / x;
                        return (int)x;
                }
        } else {
                int status;
                wait(&status);
                if (WIFSIGNALED(status)) {
                        printf("child killed by signal %d\n",
                                WTERMSIG(status));
                } else if (WIFEXITED(status)) {
                        printf("child exited with status %d\n",
                                WEXITSTATUS(status));
                }
        }
        return 0;
}
% cc -o t t.c -lm -lpthread
native% ./t
initial	fcw 37f csr 1f80
parent	fcw f7f csr ff80
thread	fcw f7f csr ff80
child	fcw f7f csr ff80
sigfpe	fcw f7b csr fd80
child exited with status 0
wsl% ./t
initial	fcw 37f csr 1f80
parent	fcw f7f csr ff80
thread	fcw 27f csr 1f80
child	fcw 27f csr 1f80
child exited with status 142

It is also confirmed that the control word is NOT inherited by a child (POSIX) thread. And the problem is not limited to x87 control word, it equally affects the SSE control and status register.

Many manifestations of the same problem were reported earlier in this ticket, let me try to summarize here the full impact, based on what I have gathered so far:

  • long double, which gcc maps into 80-bit extended precision floating point on x86, won't work as expected
  • C99 fenv(3) settings won't be passed down to child processes or threads, changes to rounding mode or exception masks will have to be set in each individual thread an application creates
  • The same is true for the commonly used flush to zero mode to speed up the denormal handling in SIMD code

Update:
There also seems to be problem with SIGFPE generation, division by zero is expected to raise a SIGFPE signal, but instead the process is terminated with some arbitrary exit status. My original intention was to see what x87 control word value would be saved into a sigcontext, as a way to probe if/how the kernel tracks the register. This behavior is the same if SSE is used instead of x87 (just replace the long double with double in the test code).

@therealkenc
Copy link
Collaborator

therealkenc commented Mar 13, 2019

which means there is no easy user space workaround by writing a wrapper program to properly initialize the control word then fork and exec

LD_PRELOAD __libc_start_main(). Something like:

#include <dlfcn.h>
#include <fpu_control.h>

template<typename Tfn>
auto dlfn(const char *name)
{
    return reinterpret_cast<Tfn>(dlsym(RTLD_NEXT, name));
}

extern "C"
int __libc_start_main(int *(main) (int, char * *, char * *), int argc, char * * ubp_av, void (*init) (void), 
    void (*fini) (void), void (*rtld_fini) (void), void (* stack_end))
{
    auto start = dlfn<decltype(&::__libc_start_main)>("__libc_start_main");
    unsigned short cw = 0x37f;
    _FPU_SETCW(cw);
    return start(main, argc, ubp_av, init, fini, rtld_fini, stack_end);
}

Utterly untested, that is off the cuff.

I believe this inheritance is required for POSIX compliance.

I believe you made that up I believe I've been corrected. But it's still pretty disappointing the incompatibility hasn't been addressed yet.

@luoqi-git
Copy link

@therealkenc Yes, it is possible to initialize the control word in the main process through LD_PRELOAD, but it will not be passed down to any child process or more seriously any child thread, which renders it useless in most applications.

I didn't make it up about POSIX compliance, whereas there could be some latitude regarding execve(2), there is no ambiguity about fork(2), see the spec about fork, and I quote,

The fork() function shall create a new process. The new process (child process) shall be an exact copy of the calling process (parent process) except as detailed below:

Here "exact copy" is to be interpreted as all machine specific/dependent states -- I know this because I am an OS developer.

This means I also understand the fix of this problem is non-trivial. The symptom indicates the x87 control word is not maintained as part of the linux process context, it's only in the NT process context. What could further complicate this effort is that x87 coprocessor is usually implemented as lazy during context switch, that is, the FPU is initially marked as not present, any FPU instruction would trigger a fpu-unavailable trap at which time FPU context (including the control word) would be restored. It is very likely the current linux subsystem module won't even see this trap, and some changes to the NT microkernel proper may have to be made.

@richfelker
Copy link

I'm quite aware of that, but the "no decent OS developer" comment is rather exaggerated. These days many operating systems do not make this optimization for various reasons; it's nowhere near as valuable as it was a couple decades ago. In any case you could make the fix that way for now, and optimize later if it turns out the performance hit is a problem, rather than leaving it broken for the sake of performance.

FWIW the issue you cited is exactly why we're not going to workaround the WSL bug in musl; doing so would make all binaries generated with the workaround unconditionally poke the fpu when they shouldn't have to, and the cost would not go away when WSL gets fixed because it would be linked into the binaries rather than in OS-level code that will go away once it's fixed right.

@luoqi-git
Copy link

@richfelker Brother, I feel your pain. I am here not because I'm a kernel developer, but because an application my team developed was hit by this bug. I would very much like to see this problem fixed and am offering any help I can render. At this point I may be desperate enough to insert an finit instruction at the beginning of any thread our app creates.

@therealkenc
Copy link
Collaborator

therealkenc commented Mar 13, 2019

FWIW the issue you cited is exactly why we're not going to workaround the WSL bug in musl; doing so would make all binaries generated with the workaround unconditionally poke the fpu when they shouldn't have to

I suspect, but can't prove, a variation of this argument is being applied by WSL devs too. It could be fixed hacked worked-around in the WSL driver, but a "proper" fix is an upstream problem. So here were are.

Which is good. Everyone is in agreement that without a perfect fix all hope is lost, and hopefully this thread can go quiet again.

@richfelker
Copy link

@therealkenc, I don't see them as equivalent arguments at all. One is about a suboptimal fix at the level of a component that ships with OS updates and never gets linked into applications, so that, if it's costly, it can go away in a future OS update when a better fix is made. The other is about hard-coding a workaround into application binaries in a way that affects running them on systems not affected by the bug, including real Linux systems and the expected post-fix WSL systems at some point in the future.

@Po-wei
Copy link

Po-wei commented Mar 24, 2019

Although you want same output between native Linux and WSL, I think the double / long double do not have explicit size in C++ standard so it is just like a Linux distro run a difference hardware and therefore has different result.
Maybe you need a quick workaround and not a substantial change in kernel.
WSL has compatibility is better though.

@njsmith
Copy link

njsmith commented Mar 24, 2019

@Po-wei It's true that the C/C++ language standards don't specify the exact precision of long double. They leave that up to the platform ABI. And the standard ABI for Linux x86 and x86-64 says that long double has 80 bit precision, and that the x87 FPU control word has to be initialized the right way.

Other platforms are free to implement C/C++ with different ABIs, but WSL is trying to implement the Linux x86/x86-64 ABI, so that's the relevant standard. This is definitely a bug in WSL.

Note that several of the people posting here are well-known experts who are involved in making C/C++ work on Linux in the first place. They know what they're talking about :-)

@Po-wei
Copy link

Po-wei commented Mar 24, 2019

@njsmith Thanks for your explanation!

@bewantbe
Copy link
Author

bewantbe commented Apr 6, 2019

@Po-wei In some way, it did breaks C++ standard library. (Beside the break of C library demonstrated previously) Specifically, it breaks std::numeric_limits<long double> in header <limits>.

For example, in WSL, std::numeric_limits<long double>::epsilon() tells me (approximately) 1.0842e-19. However, if you write a loop to actually test machine epsilon of long double, the result is 2.22045e-16:

// g++ -std=c++11 test_ld.cpp && ./a.out
#include <iostream>
#include <limits>

int main()
{
  typedef std::numeric_limits<long double> TT;
  std::cout << "mantissa bits    : " << TT::digits << "\n";    // =64
  std::cout << "         epsilon : " << TT::epsilon() << "\n"; // =1.0842e-19

  long double t = 1.0;
  while (1 + t/2 != 1) {
    t = t/2;
  }
  std::cout << "measured epsilon : " << t << "\n"; // WSL: =2.22045e-16
}

So, even a standard conforming code/programmer (that didn't assumes accuracy of long double) could be fooled by std::numeric_limits<long double> which is supposed to be a standard way to extract numerical type information. The C++ standard didn't mention floating point control word.

@AgnerF
Copy link

AgnerF commented May 28, 2019

Why has this problem never been fixed? Why is it not mentioned in the WSL FAQ?
It has caused me a lot of trouble.

@aseering
Copy link
Contributor

aseering commented Jul 3, 2019

I assume this is one of the advantages of running a real Linux kernel under virtualization (which is what WSL2 does) rather than having this aspect of syscalls be handled by the NT kernel directly.

@richfelker
Copy link

Unless WSL1 is entirely abandoned in favor of WSL2, I think this still needs to be fixed. Maybe now that WSL2 is out there and performance-sensitive users are going to be switching to it, the simple fix for WSL1 will be less controversial...?

@richfelker
Copy link

Ping. People are still hitting this. See the new reference above and the post on the musl list: https://www.openwall.com/lists/musl/2019/09/25/16

@richfelker
Copy link

Ping. Is WSL1 still a thing, and if so, can this please be fixed?

@richfelker
Copy link

richfelker commented May 10, 2021

Ping. A user just hit this again today. So apparently people still are using WSL1. Being that this would literally be a one-line fix, is there any reason it can't be done??

@Gabrielcarvfer
Copy link

Yup, it is a shame it hasn't been fixed yet.

@ldeng-ustc
Copy link

This issue cause ns-3 int64x64 TEST Failed.
https://groups.google.com/g/ns-3-users/c/xfyQMrVisrc

@jlearman
Copy link

WSL2 fails on one of my laptops, so I use WSL1.

On another (newer) laptop, WSL2 "works" but I need to use VirtualBox, and I have had no end of difficulty trying to get the two to work together, so WSL2 is not an option on that computer either.

So, some of us need WSL1. Please fix it!

@Gabrielcarvfer
Copy link

@jlearman they didn't fix this in 8 years because the NT folks refuse to do it... Unless a big company or government asks for it, I doubt they will ever bother. If I may suggest, vagrant and multipass work with VirtualBox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests