Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEGV with debugging perls with multiplicity on #90

Open
andk opened this issue May 2, 2023 · 27 comments
Open

SEGV with debugging perls with multiplicity on #90

andk opened this issue May 2, 2023 · 27 comments

Comments

@andk
Copy link

andk commented May 2, 2023

Sample fail report: http://www.cpantesters.org/cpan/report/9dadce3e-e8fc-11ed-a654-b70f1145618a

With that same perl I produced a core file and then got this stack trace:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/sand/src/perl/repoperls/installed-perls/host/k93msid/v5.36.1/29fb/bin/per'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fae1a749e03 in fortuna_start (prng=0x55a6156dd838) at ltc/prngs/fortuna.c:234
234        prng->u.fortuna.pool_idx = prng->u.fortuna.pool0_len = 0;
(gdb) bt
#0  0x00007fae1a749e03 in fortuna_start (prng=0x55a6156dd838) at ltc/prngs/fortuna.c:234
#1  fortuna_start (prng=0x55a6156dd838) at ltc/prngs/fortuna.c:217
#2  0x00007fae1a6f3552 in XS_Crypt__PRNG_new (my_perl=0x55a614cb02a0, cv=<optimized out>) at ./inc/CryptX_PRNG.xs.inc:36
#3  0x000055a61402a834 in Perl_pp_entersub (my_perl=0x55a614cb02a0) at pp_hot.c:5353
#4  0x000055a613fdf0ea in Perl_runops_debug (my_perl=0x55a614cb02a0) at dump.c:2677
#5  0x000055a613f2d999 in S_run_body (oldscope=1, my_perl=0x55a614cb02a0) at perl.c:2721
#6  perl_run (my_perl=0x55a614cb02a0) at perl.c:2644
#7  0x000055a613eee46e in main (argc=<optimized out>, argv=<optimized out>, env=<optimized out>) at perlmain.c:110

@karel-m
Copy link
Contributor

karel-m commented May 4, 2023

@sjaeckel do you have any idea what might went wrong in int fortuna_start(prng_state *prng)

The line of the segfault is https://github.com/DCIT/perl-CryptX/blob/master/src/ltc/prngs/fortuna.c#L234

@sjaeckel
Copy link

sjaeckel commented May 5, 2023

The first thing that comes to my mind is that the allocated struct isn't big enough.

Could be because LTC_FORTUNA_POOLS is different in the two compile units ... but otherwise ...

How can this be reproduced?

@andk
Copy link
Author

andk commented Aug 23, 2023

A fresh report with a more recent perl (5.38.0) that exposes the problem: http://www.cpantesters.org/cpan/report/d4da173a-1f77-11ee-a370-d61eba172296

Not every perl with similar configuration exposes the problem. But it seems like when you have a compilation that exhibits it, then it is reproducable. I just let this perl from the report above run the t/prng_fortuna.t test ~1000 times and the SEGV happened every time.

The stack trace for this perl looks practically the same as above:

Reading symbols from /home/sand/src/perl/repoperls/installed-perls/host/k93msid/v5.38.0/29fb/bin/perl...
[New LWP 2944018]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/sand/src/perl/repoperls/installed-perls/host/k93msid/v5.38.0/29fb/bin/per'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f778374b0f3 in fortuna_start (prng=0x556d8db50358) at ltc/prngs/fortuna.c:234
234        prng->u.fortuna.pool_idx = prng->u.fortuna.pool0_len = 0;
(gdb) bt
#0  0x00007f778374b0f3 in fortuna_start (prng=0x556d8db50358) at ltc/prngs/fortuna.c:234
#1  fortuna_start (prng=0x556d8db50358) at ltc/prngs/fortuna.c:217
#2  0x00007f7783717bed in XS_Crypt__PRNG_new (my_perl=0x556d8d1ba2a0, cv=<optimized out>) at ./inc/CryptX_PRNG.xs.inc:36
#3  0x0000556d8bdba514 in Perl_pp_entersub (my_perl=0x556d8d1ba2a0) at pp_hot.c:5555
#4  0x0000556d8bd6937a in Perl_runops_debug (my_perl=0x556d8d1ba2a0) at dump.c:2861
#5  0x0000556d8bc7d8b8 in S_run_body (oldscope=1, my_perl=0x556d8d1ba2a0) at perl.c:2812
#6  perl_run (my_perl=0x556d8d1ba2a0) at perl.c:2727
#7  0x0000556d8bc43475 in main (argc=<optimized out>, argv=<optimized out>, env=<optimized out>) at perlmain.c:127

@sjaeckel
Copy link

How can I reproduce this locally? Can I somehow get access to this exact version that fails?

I tried it locally with the latest version and

$ perl --version

This is perl 5, version 38, subversion 0 (v5.38.0) built for x86_64-linux-thread-multi
[...]
$ make test
[...]
All tests successful.
Files=137, Tests=39024, 18 wallclock secs ( 1.20 usr  0.26 sys + 16.01 cusr  1.31 csys = 18.78 CPU)
Result: PASS

@karel-m
Copy link
Contributor

karel-m commented Oct 2, 2023

It is not easy to reproduce, I have tried to build perl-5.36.1 binary on Ubuntu-22.04 with the same options as in the original failing report:

./Configure \
    -Dprefix=/home/miko/myperl-out \
    -Dmyhostname=myhost \
    -Dinstallusrbinperl=n \
    -Uversiononly \
    -Dusedevel \
    -Ui_db \
    -Dlibswanted='cl pthread socket inet nsl gdbm dbm malloc dl ld sun m crypt sec util c cposix posix ucb BSD gdbm_compat' \
    -Duseithreads \
    -Uuselongdouble \
    -DEBUGGING=both \
    -des

But I was unable to reproduce the failure in t/prng_fortuna.t test.

@Leont
Copy link

Leont commented Oct 17, 2023

I have been able to reproduce it. The problem is in these innocent looking lines.

   prng->u.fortuna.pool_idx = prng->u.fortuna.pool0_len = 0;
   prng->u.fortuna.reset_cnt = prng->u.fortuna.wd = 0;

Somehow those can result in a null-pointer dereference. I don't understand what's going on here either, it only happens with -O2, with -O0 it runs fine. Is this a compiler bug, or are we missing something obvious that's undefined in C?

I worked around it by putting removing those two lines and using this instead (before initializing the pools)

memset(&prng->u.fortuna, '\0', sizeof(struct fortuna_prng));

Obviously, this is not a very satisfying fix.

@karel-m
Copy link
Contributor

karel-m commented Oct 18, 2023

@sjaeckel ^^^

@sjaeckel
Copy link

@karel-m I'm already watching this issue :)

I have been able to reproduce it.

@Leont How?

I worked around it by [...]

memset(&prng->u.fortuna, '\0', sizeof(struct fortuna_prng));

TBH I would prefer to leave the fortuna code as it is and wait for the moment when someone solves the underlying problem, since that can't be the real solution. Or am I mistaken here?

@karel-m
Copy link
Contributor

karel-m commented Oct 18, 2023

Just for completeness here is a code fragment from my perl xs/c module, something may be wrong here:

typedef struct prng_struct {            /* used by Crypt::PRNG */
  prng_state state;
  struct ltc_prng_descriptor *desc;
  IV last_pid;
} *Crypt__PRNG;

/* ================================ */

        Newz(0, RETVAL, 1, struct prng_struct);                        // memory allocation of prng_struct
        if (!RETVAL) croak("FATAL: Newz failed");

        id = cryptx_internal_find_prng(prng_name);
        if (id == -1) {
          Safefree(RETVAL);
          croak("FATAL: find_prng failed for '%s'", prng_name);
        }
        RETVAL->last_pid = curpid;
        RETVAL->desc = &prng_descriptor[id];

        rv = RETVAL->desc->start(&RETVAL->state);                     // the crash
        if (rv != CRYPT_OK) {
          Safefree(RETVAL);
          croak("FATAL: PRNG_start failed: %s", error_to_string(rv));
        }

@karel-m
Copy link
Contributor

karel-m commented Oct 18, 2023

And it is also worth mentioning that the same code works without crash for Crypt::PRNG::ChaCha20 / Crypt::PRNG::RC4 / Crypt::PRNG::Sober128 / Crypt::PRNG::Yarrow the difference is only in id returned by cryptx_internal_find_prng(). Which supports the idea that it is fortuna specific.

@sjaeckel
Copy link

perl-CryptX/CryptX.xs

Lines 121 to 125 in fc61205

typedef struct prng_struct { /* used by Crypt::PRNG */
prng_state state;
struct ltc_prng_descriptor *desc;
IV last_pid;
} *Crypt__PRNG;

Newz(0, RETVAL, 1, struct prng_struct);
if (!RETVAL) croak("FATAL: Newz failed");
id = cryptx_internal_find_prng(prng_name);
if (id == -1) {
Safefree(RETVAL);
croak("FATAL: find_prng failed for '%s'", prng_name);
}
RETVAL->last_pid = curpid;
RETVAL->desc = &prng_descriptor[id];
rv = RETVAL->desc->start(&RETVAL->state);
if (rv != CRYPT_OK) {
Safefree(RETVAL);
croak("FATAL: PRNG_start failed: %s", error_to_string(rv));
}

IMO that code looks fine.

As pointed out by @Leont the crash also doesn't happen on the call of start() but inside the function, which looks even stranger. I guess there's no way to find out what really goes wrong without having a reproducer of the crash and investigating in depth.

@Leont
Copy link

Leont commented Oct 18, 2023

@Leont How?

I suspect the issue only occurs on debugging perls, I don't fully understand that because AFAICT that shouldn't affect the crypto code at all.

@sjaeckel
Copy link

sjaeckel commented Oct 18, 2023

only occurs on debugging perls

That doesn't matter, it shouldn't happen. Please write down how it can be reproduced :)

@sjaeckel
Copy link

While looking through the Perl internals regarding memory management ... Could this issue be related to mixing native and Perl-specific malloc/free calls? Using native malloc to allocate memory but Perl-free to free it or vice versa?

Have you ever thought of using the Perl-specific malloc/free calls inside ltc/ltm instead of the native ones? As the macro magic involved is quite extensive until you arrive at the really called Perl MM function I guess the easiest would be to trampoline those inside cryptx ...

void* cryptx_malloc(size_t sz)
{
   Newz(0, RETVAL, 1, sz);
   return RETVAL;
}
void cryptx_free(void *mem)
{
   Safefree(mem);
}
/* etc. */

Then pre-define XMALLOC etc. while compiling ltc -DXMALLOC=cryptx_malloc.

Or do you already do that and I missed it while searching through the sources? :)

@Leont
Copy link

Leont commented Oct 19, 2023

Then pre-define XMALLOC etc. while compiling ltc -DXMALLOC=cryptx_malloc.

That would be -DXMALLOC=PerlMem_malloc -DXFREE=PerlMem_free etc…

But I don't think that's what's going on here.

@sjaeckel
Copy link

Then pre-define XMALLOC etc. while compiling ltc -DXMALLOC=cryptx_malloc.

That would be -DXMALLOC=PerlMem_malloc -DXFREE=PerlMem_free etc…

https://github.com/Perl/perl5/blob/dd4eb78c55aab441aec1639b1dd49f88bd960831/perl.h#L1697-L1739

You're sure?

But I don't think that's what's going on here.

If nobody reveals how it can be reproduced I'm pretty sure we will never find out.

@Leont
Copy link

Leont commented Oct 22, 2023

Please write down how it can be reproduced :)

If using perlbrew, compile a perl with «perl install perl-5.38.0 --debug --thread», and install the distribution on that perl.

@sjaeckel
Copy link

sjaeckel commented Oct 23, 2023

If using perlbrew, compile a perl with «perl install perl-5.38.0 --debug --thread», and install the distribution on that perl.

perl_V.txt

perl_V2.txt

It still doesn't fail on any of my machines... and with those two (slightly) different build configurations.

After looking through some of the failed builds on https://www.cpantesters.org/distro/C/CryptX.html I saw that all of the segfaults were on a machine called k93msid ... maybe there's something wrong on that box? Would it be possible to get SSH access to that machine?

@sjaeckel
Copy link

@Leont you've been able to reproduce the issue on a machine that you have access to?

@Leont
Copy link

Leont commented Oct 27, 2023

you've been able to reproduce the issue on a machine that you have access to?

Yes, I can reliably reproduce it on my computer.

@sjaeckel
Copy link

sjaeckel commented Oct 30, 2023

Can you maybe tell me all the details of the tools you're using in the process? Which Distro and Compiler versions are you using? Can you please write down the exact command how you run all the tools? perlbrew etc.? Or could you maybe even create a docker image to reproduce this, based on your distro?

Or do you see another way how we can debug this?

@Leont
Copy link

Leont commented Aug 30, 2024

karel-m added the should be fixed in libtomcrypt label

I can confirm I can not reproduce the issue with CryptX 0.080_006

@sjaeckel
Copy link

@karel-m what does that label exactly mean? First I thought that an issue tagged with this label "is fixed in ltc". After having a second thought is it instead "depends on ltc to be fixed"?
IIUC @Leont understood the former!? I'd now say it's the latter, because we didn't change anything relevant in ltc :)

@karel-m
Copy link
Contributor

karel-m commented Aug 31, 2024

@sjaeckel the label indicates that the issue requires a fix in the libtomcrypt sources (at least, that's my opinion, which you might not share :). Maybe I should rename it to "needs a fix in libtomcrypt."

FYI CryptX 0.080_006 = libtomcrypt current develop branch 12bf723b which includes many changes since CryptX 0.080. Interestingly, there were basically no changes to the Fortuna code, so I have no idea why the above reported issue seems to have disappeared.

@sjaeckel
Copy link

at least, that's my opinion, which you might not share :).

I'm sharing your opinion and I doubt that the underlying issue is fixed.

@Leont which CPU model does the computer have you were seeing this on?

Maybe I should rename it to "needs a fix in libtomcrypt."

👍

@Leont
Copy link

Leont commented Aug 31, 2024

@Leont which CPU model does the computer have you were seeing this on?

AMD Ryzen 5 3600 6-Core Processor.
gcc version 14.1.1

@sjaeckel
Copy link

OK, that CPU has AES-NI support.

... and is AES-NI even enabled? nevermind. I was just thinking aloud and I still don't get it where the problem could originate...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants