Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIG_PENDING_DIE_COUNT kills program in new situation #14233

Open
p5pRT opened this issue Nov 12, 2014 · 24 comments
Open

SIG_PENDING_DIE_COUNT kills program in new situation #14233

p5pRT opened this issue Nov 12, 2014 · 24 comments

Comments

@p5pRT
Copy link

@p5pRT p5pRT commented Nov 12, 2014

Migrated from rt.perl.org#123188 (status was 'open')

Searchable as RT123188$

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 12, 2014

From d.white@imperial.ac.uk

Created by d.white@imperial.ac.uk

This is a bug report for perl from d.white@​imperial.ac.uk,
generated with the help of perlbug 1.39 running under perl 5.18.2.

-----------------------------------------------------------------
As in rejected bug #112404, we have just encountered the
SIG_PENDING_DIE_COUNT limit. On Perl 5.18.2 on Ubuntu 14.04 linux,
we have some locally developed highly threaded Perl code that
controls a set of 40-80 firewalled desktop PCs in "exam mode".
The controller manages a set of ssh connections to the client desktop
PCs, telling them when to start into exam mode, then every N minutes
telling them to take a copy of the local exam directory where the
student is working, then at the end tells the PCs to exit exam mode
and reboot. It uses a pool of worker threads, various "​: shared"
variables, and a Thread​::Queue of work-jobs (names of hosts to send
a command to). It does no explicit signal handling.

This code has worked reliably across multiple Perl versions on
multiple Ubuntu distros for approx 6 years, but on Ubuntu 14.04,
almost immediately we hit the dreaded

"Maximal count of pending signals (120) exceeded"

inside Thread​::Queue line 70. This is in the dequeue()
function where it appears to be using signals to build condition
variables, waiting and signalling on one.

We have rebuilt our own Perl 5.18.2 experimental version with

# define SIG_PENDING_DIE_COUNT 520

and invested approximately 8 hours of time in experiments on
80 desktop PCs, and found that this completely resolves our issue,
but this is a horrid kludge, given that we GUESSED the value 520​:-)

May I request that this code, the rationale for which was described in​:

www.nntp.perl.org/group/perl.perl5.porters/2006/12/msg119236.html

[in summary​: added in Dec 2008 to Perl 5.8.8 to fix a specific bug on OS/2,
with an arbitrary limit of 120 pending signals]

makes no sense now and should be altered, removed, or reimplemented, because
it is actively breaking real code.

One possible minimally intrusive solution would be to turn the limit from
a constant to a variable and add a command line option to perl to allow us
to set the limit to an arbitrary value. We would be willing to write
and submit such a patch if this would help.

Best Wishes
Duncan White
Systems Manager,
Computing Support Group,
Dept of Computing,
Imperial College London.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.18.2:

Configured by Debian Project at Thu Mar 27 18:28:21 UTC 2014.

Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
   
  Platform:
    osname=linux, osvers=3.2.0-58-generic, archname=x86_64-linux-gnu-thread-multi
    uname='linux brownie 3.2.0-58-generic #88-ubuntu smp tue dec 3 17:37:58 utc 2013 x86_64 x86_64 x86_64 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Dldflags= -Wl,-Bsymbolic-functions -Wl,-z,relro -Dlddlflags=-shared -Wl,-Bsymbolic-functions -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.18 -Darchlib=/usr/lib/perl/5.18 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.18.2 -Dsitearch=/usr/local/lib/perl/5.18.2 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.18.2 -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.8.2', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib /usr/lib
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=, so=so, useshrplib=true, libperl=libperl.so.5.18.2
    gnulibc_version='2.19'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector'

Locally applied patches:
    DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
    DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
    DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
    DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories.
    DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
    DEBPKG:debian/libperl_embed_doc - http://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking
    DEBPKG:fixes/respect_umask - Respect umask during installation
    DEBPKG:debian/writable_site_dirs - Set umask approproately for site install directories
    DEBPKG:debian/extutils_set_libperl_path - EU:MM: Set location of libperl.a to /usr/lib
    DEBPKG:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor
    DEBPKG:debian/prefix_changes - Fiddle with *PREFIX and variables written to the makefile
    DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
    DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
    DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
    DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
    DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian
    DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy
    DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
    DEBPKG:fixes/net_smtp_docs - [rt.cpan.org #36038] http://bugs.debian.org/100195 Document the Net::SMTP 'Port' option
    DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
    DEBPKG:debian/cpanplus_definstalldirs - http://bugs.debian.org/533707 Configure CPANPLUS to use the site directories by default.
    DEBPKG:debian/cpanplus_config_path - Save local versions of CPANPLUS::Config::System into /etc/perl.
    DEBPKG:debian/deprecate-with-apt - http://bugs.debian.org/702096 Point users to Debian packages of deprecated core modules
    DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts
    DEBPKG:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository
    DEBPKG:debian/patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.18.2-2ubuntu1 in patchlevel.h
    DEBPKG:debian/skip-kfreebsd-crash - http://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD
    DEBPKG:fixes/document_makemaker_ccflags - http://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags}
    DEBPKG:debian/find_html2text - http://bugs.debian.org/640479 Configure CPAN::Distribution with correct name of html2text
    DEBPKG:debian/hurd_test_skip_stack - http://bugs.debian.org/650175 Disable failing GNU/Hurd tests dist/threads/t/stack.t
    DEBPKG:fixes/manpage_name_Test-Harness - http://bugs.debian.org/650451 [rt.cpan.org #73399] cpan/Test-Harness: add NAME headings in modules with POD
    DEBPKG:debian/makemaker-pasthru - http://bugs.debian.org/660195 [rt.cpan.org #28632] Make EU::MM pass LD through to recursive Makefile.PL invocations
    DEBPKG:debian/perl5db-x-terminal-emulator.patch - http://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl
    DEBPKG:debian/cpan-missing-site-dirs - http://bugs.debian.org/688842 Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable
    DEBPKG:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http://bugs.debian.org/587650 Memoize::Storable: respect 'nstore' option not respected
    DEBPKG:fixes/net_ftp_failed_command - [rt.cpan.org #37700] http://bugs.debian.org/491062 Net::FTP: cope gracefully with a failed command
    DEBPKG:fixes/perlbug-patchlist - [3541c11] http://bugs.debian.org/710842 [perl #118433] Make perlbug look up the list of local patches at run time
    DEBPKG:fixes/module_metadata_security_doc - [68cdd4b] CVE-2013-1437 documentation fix
    DEBPKG:fixes/module_metadata_taint_fix - [bff978f] http://bugs.debian.org/722210 [rt.cpan.org #88576] untaint version, if needed, in Module::Metadata
    DEBPKG:fixes/IPC-SysV-spelling - http://bugs.debian.org/730558 [rt.cpan.org #86736] Fix spelling of IPC_CREAT in IPC-SysV documentation
    DEBPKG:fixes/fix-undef-source -


@INC for perl 5.18.2:
    /homes/dcw/perl5/lib/perl5/x86_64-linux-gnu-thread-multi
    /homes/dcw/perl5/lib/perl5/x86_64-linux-gnu-thread-multi
    /homes/dcw/perl5/lib/perl5
    /etc/perl
    /usr/local/lib/perl/5.18.2
    /usr/local/share/perl/5.18.2
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.18
    /usr/share/perl/5.18
    /usr/local/lib/site_perl
    .


Environment for perl 5.18.2:
    HOME=/homes/dcw
    LANG=en_GB.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/usr/lib:/usr/openwin/lib:/usr/dt/lib:/opt/lib:/usr/lib/x86_64-linux-gnu:/usr/local/pgsql/lib:/opt/gurobi562/linux64/lib:/usr/local/pgsql/lib:/opt/gurobi562/linux64/lib
    LOGDIR (unset)
    PATH=/homes/dcw/perl5/bin:/homes/dcw/c-tools/bin:/homes/dcw/c-tools/bin/x86_64:/usr/lib/ccache:/usr/local/condor/bin:/homes/dcw/bin:/homes/dcw/bin/x86_64:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/sbin:/usr/sbin:/usr/games:.:/opt/gurobi562/linux64/bin:/homes/dcw/.rvm/bin
    PERL5LIB=/homes/dcw/perl5/lib/perl5/x86_64-linux-gnu-thread-multi:/homes/dcw/perl5/lib/perl5
    PERL_BADLANG (unset)
    PERL_LOCAL_LIB_ROOT=/homes/dcw/perl5
    PERL_MB_OPT=--install_base /homes/dcw/perl5
    PERL_MM_OPT=INSTALL_BASE=/homes/dcw/perl5
    SHELL=/bin/tcsh

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 13, 2014

From @jkeenan

On Wed Nov 12 07​:09​:58 2014, d.white@​imperial.ac.uk wrote​:

This is a bug report for perl from d.white@​imperial.ac.uk,
generated with the help of perlbug 1.39 running under perl 5.18.2.

-----------------------------------------------------------------
As in rejected bug #112404, we have just encountered the
SIG_PENDING_DIE_COUNT limit. On Perl 5.18.2 on Ubuntu 14.04 linux,
we have some locally developed highly threaded Perl code that
controls a set of 40-80 firewalled desktop PCs in "exam mode".
The controller manages a set of ssh connections to the client desktop
PCs, telling them when to start into exam mode, then every N minutes
telling them to take a copy of the local exam directory where the
student is working, then at the end tells the PCs to exit exam mode
and reboot. It uses a pool of worker threads, various "​: shared"
variables, and a Thread​::Queue of work-jobs (names of hosts to send
a command to). It does no explicit signal handling.

This code has worked reliably across multiple Perl versions on
multiple Ubuntu distros for approx 6 years, but on Ubuntu 14.04,
almost immediately we hit the dreaded

"Maximal count of pending signals (120) exceeded"

inside Thread​::Queue line 70. This is in the dequeue()
function where it appears to be using signals to build condition
variables, waiting and signalling on one.

We have rebuilt our own Perl 5.18.2 experimental version with

# define SIG_PENDING_DIE_COUNT 520

and invested approximately 8 hours of time in experiments on
80 desktop PCs, and found that this completely resolves our issue,
but this is a horrid kludge, given that we GUESSED the value 520​:-)

May I request that this code, the rationale for which was described
in​:

www.nntp.perl.org/group/perl.perl5.porters/2006/12/msg119236.html

[in summary​: added in Dec 2008 to Perl 5.8.8 to fix a specific bug on
OS/2,
with an arbitrary limit of 120 pending signals]

makes no sense now and should be altered, removed, or reimplemented,
because
it is actively breaking real code.

One possible minimally intrusive solution would be to turn the limit
from
a constant to a variable and add a command line option to perl to
allow us
to set the limit to an arbitrary value. We would be willing to write
and submit such a patch if this would help.

Best Wishes
Duncan White
Systems Manager,
Computing Support Group,
Dept of Computing,
Imperial College London.

1. Can you suggest any way to reproduce this problem outside of your customized environment?

2. Have you ruled out the possibility that the problem lies in the version of Ubuntu you are using (which I also use) as distinct from Perl? If so, how have you made that rule-out?

3. Is the 'perl -V' output attached to your bug report from the machine where you observed the problems? If not, can you provide that output?

Thank you very much.
Jim Keenan

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 13, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 13, 2014

From @Leont

On Wed, Nov 12, 2014 at 4​:09 PM, via RT <perlbug-followup@​perl.org> wrote​:

As in rejected bug #112404, we have just encountered the
SIG_PENDING_DIE_COUNT limit. On Perl 5.18.2 on Ubuntu 14.04 linux,
we have some locally developed highly threaded Perl code that
controls a set of 40-80 firewalled desktop PCs in "exam mode".
The controller manages a set of ssh connections to the client desktop
PCs, telling them when to start into exam mode, then every N minutes
telling them to take a copy of the local exam directory where the
student is working, then at the end tells the PCs to exit exam mode
and reboot. It uses a pool of worker threads, various "​: shared"
variables, and a Thread​::Queue of work-jobs (names of hosts to send
a command to). It does no explicit signal handling.

This code has worked reliably across multiple Perl versions on
multiple Ubuntu distros for approx 6 years, but on Ubuntu 14.04,
almost immediately we hit the dreaded

"Maximal count of pending signals (120) exceeded"

inside Thread​::Queue line 70. This is in the dequeue()
function where it appears to be using signals to build condition
variables, waiting and signalling on one.

Thread​::Queue doesn't use signals. It would be helpful to strace your
program, to figure out what signal you're getting and where it comes from
(I should note, this can only happen for signals that you set a signal
handler for but have not blocked).

We have rebuilt our own Perl 5.18.2 experimental version with

# define SIG_PENDING_DIE_COUNT 520

and invested approximately 8 hours of time in experiments on
80 desktop PCs, and found that this completely resolves our issue,
but this is a horrid kludge, given that we GUESSED the value 520​:-)

May I request that this code, the rationale for which was described in​:

It's just as arbitrary as the previous value. Odds are it would trigger
again under a higher load. I'd rather understand the issue before paving
over it.

makes no sense now and should be altered, removed, or reimplemented,
because
it is actively breaking real code.

Probably.

One possible minimally intrusive solution would be to turn the limit from
a constant to a variable and add a command line option to perl to allow us
to set the limit to an arbitrary value. We would be willing to write
and submit such a patch if this would help.

There is some precedent for such things (such as the PERL_SIGNALS
environmental variable).

Leon

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 13, 2014

From d.white@imperial.ac.uk

Dear James,

On 13/11/14 01​:41, James E Keenan via RT wrote​:

On Wed Nov 12 07​:09​:58 2014, d.white@​imperial.ac.uk wrote​:

This is a bug report for perl from d.white@​imperial.ac.uk,
generated with the help of perlbug 1.39 running under perl 5.18.2.

-----------------------------------------------------------------
As in rejected bug #112404, we have just encountered the
SIG_PENDING_DIE_COUNT limit. On Perl 5.18.2 on Ubuntu 14.04 linux,
we have some locally developed highly threaded Perl code that..

1. Can you suggest any way to reproduce this problem outside of your customized environment?

I was thinking about that, I might be able to whip up a test version
that only does the thread and thread queue work, by extracting a minimal
part of the existing system (and none of the client code), and then
see whether that suffers the problem too. Whether ssh can be eliminated
is unclear; I'd love to of course! Leave this with me over the weekend
to think further on.

2. Have you ruled out the possibility that the problem lies in the version of Ubuntu
you are using (which I also use) as distinct from Perl? If so, how have you made that rule-out?

That is, of course, totally possible at several levels.

- Perl 5.18.2 as packaged by Ubuntu has many Ubuntu-specific patches.
  We could build a vanilla perl.org Perl 5.18.2 and try that (with
  the unchanged limit of 120 pending signals) if you like.

- Underneath Perl itself, gcc could compile Perl slightly differently
  and change something subtle wrt speed, or the linux kernel could
  handle signals slightly differently, deliver them slightly faster.
  There's really nothing we can do to eliminate such possibilities, we
  are an academic department in mid-term and academic year, we can't
  change C compiler, linux kernel and/or OS as a test.

I should mention that all the PCs in the tests (the controller and
it's 79 clients themselves), are unchanged hardware from last year
(when we did not experience the problem with Perl 5.14 on Ubuntu
13.04). So at least we can eliminate new faster hardware delivering
signals faster and tripping the limit​:-)

I would still argue that the presence of an arbitrary 8 year old limit
of 120 pending signals is unprincipled (as the original patch submitter
said back in 2008!) and is therefore a bad thing.

3. Is the 'perl -V' output attached to your bug report from the machine
where you observed the problems? If not, can you provide that output?

Yes, it is from the machine where we observed the problem. I checked
that carefully​:-) We have also experienced the problem running the
controller software on several identically configured PCs in our lab.

Thank you very much.
Jim Keenan

Thanks for getting in touch so quickly.

cheers
Duncan White

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 14, 2014

From @craigberry

On Thu, Nov 13, 2014 at 12​:18 PM, Duncan White <d.white@​imperial.ac.uk> wrote​:

I would still argue that the presence of an arbitrary 8 year old limit
of 120 pending signals is unprincipled (as the original patch submitter
said back in 2008!) and is therefore a bad thing.

Yep​:

http​://perl5.git.perl.org/perl.git/commit/2563cec55ae473562ff3ccda41cd10289db419be?f=mg.c

I agree it's not nice to have these guessed-at, hard-coded numbers,
but presumably the same limit was in previous Perl versions that
worked for you, so there must be more to the story. As tempting as it
is to simply remove the limit, there is no reason to believe the
original reason for its existence has gone away. A resource
exhaustion that leads to a deadlock is a lot nastier than a program
dying, even if the limit chosen to prevent that on OS/2 is quite
inappropriate for other systems. It's also probably quite difficult
to guess at a better value that is portable.

FWIW, in general, newer Perls should have fewer pending signals
because PERL_ASYNC_CHECK() is called in places it wasn't earlier
(sorry, haven't researched the exact timeline).

A quick look at threads​::shared (used by Thread​::Queue) indicates that
cond_signal will be using either condition_signal or
pthread_cond_signal under the hood. Which makes me skeptical whether
what it's doing would even be incrementing PL_sig_pending and
contributing to the limit being hit.

So I think the real question here is what causes a lot more signals to
be pending than was the case when this worked. If there is a
particular Perl op taking a lot longer than it used to, that would be
one way because the PL_sig_pending count increments until the op
completes and the pending signals are delivered.

If you are stuck building your own Perl from source for now, you
should be able to get what you need by configuring like so​:

./Configure -Accflags=-DSIG_PENDING_DIE_COUNT=N

where N is 520 or whatever number works for you.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 17, 2014

From d.white@imperial.ac.uk

Dear all,

On 14/11/14 00​:51, Craig Berry via RT wrote​:

On Thu, Nov 13, 2014 at 12​:18 PM, Duncan White <d.white@​imperial.ac.uk> wrote​:

I would still argue that the presence of an arbitrary 8 year old limit
of 120 pending signals is unprincipled (as the original patch submitter
said back in 2008!) and is therefore a bad thing.

Yep​:

http​://perl5.git.perl.org/perl.git/commit/2563cec55ae473562ff3ccda41cd10289db419be?f=mg.c

I agree it's not nice to have these guessed-at, hard-coded numbers,
but presumably the same limit was in previous Perl versions that
worked for you, so there must be more to the story. As tempting as it
is to simply remove the limit, there is no reason to believe the
original reason for its existence has gone away. A resource
exhaustion that leads to a deadlock is a lot nastier than a program
dying, even if the limit chosen to prevent that on OS/2 is quite
inappropriate for other systems. It's also probably quite difficult
to guess at a better value that is portable.

Understood.

FWIW, in general, newer Perls should have fewer pending signals
because PERL_ASYNC_CHECK() is called in places it wasn't earlier
(sorry, haven't researched the exact timeline).

A quick look at threads​::shared (used by Thread​::Queue) indicates that
cond_signal will be using either condition_signal or
pthread_cond_signal under the hood. Which makes me skeptical whether
what it's doing would even be incrementing PL_sig_pending and
contributing to the limit being hit.

So I think the real question here is what causes a lot more signals to
be pending than was the case when this worked. If there is a
particular Perl op taking a lot longer than it used to, that would be
one way because the PL_sig_pending count increments until the op
completes and the pending signals are delivered.

I agree we need to investigate this further. We are working on a
cutdown set of programs that we can share with you guys so that you can
try to reproduce the problem, and we are also investigating vanilla
perl vs ubuntu perl (to try to eliminate at least 1 degree of freedom).

If you are stuck building your own Perl from source for now, you
should be able to get what you need by configuring like so​:

./Configure -Accflags=-DSIG_PENDING_DIE_COUNT=N

where N is 520 or whatever number works for you.

That's useful info, we built a vanilla 5.18.2 perl with the #define
changed to 520, but a configure argument is much neater; my colleagues
used that version on Friday for the next electronic exam, and even with
that limit they hit it once, which impressed me! This is further
evidence that just changing the limit is a bad idea, understanding
is better.

I'll send more info when I have it.

cheers
Duncan

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 19, 2014

From d.white@imperial.ac.uk

I wrote​:

I agree we need to investigate this further. We are working on a
cutdown set of programs that we can share with you guys so that you can
try to reproduce the problem, and we are also investigating vanilla
perl vs ubuntu perl (to try to eliminate at least 1 degree of freedom).

If you are stuck building your own Perl from source for now, you
should be able to get what you need by configuring like so​:

./Configure -Accflags=-DSIG_PENDING_DIE_COUNT=N

where N is 520 or whatever number works for you.

That's useful info, we built a vanilla 5.18.2 perl with the #define
changed to 520, but a configure argument is much neater; my colleagues
used that version on Friday for the next electronic exam, and even with
that limit they hit it once, which impressed me! This is further
evidence that just changing the limit is a bad idea, understanding
is better.

I'll send more info when I have it.

Update​: We have done a number of additional tests and investigations,
we now heavily suspect a change introduced in Perl 5.16

http​://search.cpan.org/dist/perl-5.16.0/pod/perldelta.pod

"system now temporarily blocks the SIGCHLD signal handler, to prevent
the signal handler from stealing the exit status [perl #105700]."

https://rt.perl.org/Public/Bug/Display.html?id=105700

talks of system() returning -1 [which we see, btw] and blocking signals.
In our fertile imaginations, blocking signals while system() runs
(our code, remember, does LOTS of system() calls, mainly scp commands)
could easily lead to large numbers of pending signals!

In more detail​:

- we have built and used several Perl 5.18.2s (Ubuntu pkg build;
  vanilla build from perl.org, 520-pending-signals version of vanilla)
  and they ALL show the bug.

- we have built a vanilla Perl 5.14 and CANNOT get that to show the bug,
  despite intensive testing. We hypothesize that Perl 5.14 DOES NOT
  HAVE the problem, which matches the fact that last year we were using
  Perl 5.14 on Ubuntu 13.04 and never experienced this problem.

- we are just building a vanilla Perl 5.16 and will test it shortly;
  if we are correct about [perl #105700] we expect to see the bug.
  Of course I will confirm/deny that as soon as I can, given it's
  importance!

- we have built several cutdown versions of our code, all of which
  strip out the "site-specific exam system", that we can share with
  you, and we are bisecting between one which shows the bug and one
  does not. These versions cut down the original exam system from
  2500 lines of perl to between 1400-1800 lines of perl. We are still
  trying to cut it down further, but please say if you want us to share
  our code despite it's size.

- finally, I have built yet-another Perl 5.18.2 with a hack patch I
  wrote to write a time series of #pending signals vs epoch time,
  this shows #pending signals varying randomly between 1 and 77,
  largely independent of number of worker threads used. This may
  add evidence to our idea that blocking pending signals while
  system() runs may well lead to much large numbers of pending signals
  than the "120 is ludicrous" original context envisaged.

cheers
Duncan

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 19, 2014

From d.white@imperial.ac.uk

Dear all,

On 19/11/14 13​:34, I (Duncan White) wrote​:

Update​: We have done a number of additional tests and investigations,
we now heavily suspect a change introduced in Perl 5.16

http​://search.cpan.org/dist/perl-5.16.0/pod/perldelta.pod

"system now temporarily blocks the SIGCHLD signal handler, to prevent
the signal handler from stealing the exit status [perl #105700]."

https://rt.perl.org/Public/Bug/Display.html?id=105700

talks of system() returning -1 [which we see, btw] and blocking signals.
In our fertile imaginations, blocking signals while system() runs
(our code, remember, does LOTS of system() calls, mainly scp commands)
could easily lead to large numbers of pending signals!

In more detail​:

- we have built and used several Perl 5.18.2s (Ubuntu pkg build;
vanilla build from perl.org, 520-pending-signals version of vanilla)
and they ALL show the bug.

I should stress here that this means that our earlier belief that the
520 pending signals version "reliably fixes our problem" was wrong;
under greater load ALL 5.18.2 builds experience the problem, whether
the limit is set to 120 or 520.

- we have built a vanilla Perl 5.14 and CANNOT get that to show the bug,
despite intensive testing. We hypothesize that Perl 5.14 DOES NOT
HAVE the problem, which matches the fact that last year we were using
Perl 5.14 on Ubuntu 13.04 and never experienced this problem.

- we are just building a vanilla Perl 5.16 and will test it shortly;
if we are correct about [perl #105700] we expect to see the bug.
Of course I will confirm/deny that as soon as I can, given it's
importance!

I spent much of this afternoon running tests using our vanilla Perl
5.16 under conditions of high load, and I can confirm that Perl 5.16
DOES suffer the bug.

Thus, our current position is that Perl 5.14 does NOT suffer the bug,
while vanilla perl 5.16 and several Perl 5.18 builds *DO* suffer the
bug. We further speculate that

https://rt.perl.org/Public/Bug/Display.html?id=105700

(introduced in Perl 5.16) may be the cause of the changed behaviour.

We will continue to experiment with our cutdown versions, attempting
to derive the smallest possible test program exhibiting the
"too many pending signals" behaviour.

cheers
Duncan

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 19, 2014

From @Leont

On Wed, Nov 19, 2014 at 2​:34 PM, Duncan White <d.white@​imperial.ac.uk>
wrote​:

Update​: We have done a number of additional tests and investigations,
we now heavily suspect a change introduced in Perl 5.16

http​://search.cpan.org/dist/perl-5.16.0/pod/perldelta.pod

"system now temporarily blocks the SIGCHLD signal handler, to prevent
the signal handler from stealing the exit status [perl #105700]."

https://rt.perl.org/Public/Bug/Display.html?id=105700

talks of system() returning -1 [which we see, btw] and blocking signals.
In our fertile imaginations, blocking signals while system() runs
(our code, remember, does LOTS of system() calls, mainly scp commands)
could easily lead to large numbers of pending signals!

If a signal is blocked, it will not be delivered to perl in the first place
until it is unblocked, so it won't lead to signals pending in userland
(it's instead pending in kernel-land). I don't see how that can explain
what you're observing by itself.

Also, shortly after the system() call the delayed signal handler should run
(probably in pp_nextstate or pp_and), this does not sound like a logical
explanation. You could try to compile with that commit reverted and see
what happens though, there may be something non-obvious going on.

Can you «strace -e signal» your program? That should tell you more about
what's going on.

Leon

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 19, 2014

From @craigberry

On Wed, Nov 19, 2014 at 11​:17 AM, Leon Timmermans <fawaka@​gmail.com> wrote​:

If a signal is blocked, it will not be delivered to perl in the first place
until it is unblocked, so it won't lead to signals pending in userland (it's
instead pending in kernel-land). I don't see how that can explain what
you're observing by itself.

I wonder if the mix of signals and threads is involved. We're
blocking and then unblocking SIGCHLD using sigprocmask rather than
pthread_sigprocmask. If you unblock SIGCHLD process-wide and have 100
threads in flight, do they all get the signal? More specifically,
does *each* thread increment its own copy of PL_sig_pending whenever
*any* thread gets a signal? Not sure how many threads the OP has, but
I got the impression it was more than a handful (one per exam taker or
something).

Just thinking out loud and haven't really analyzed anything.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 19, 2014

From @Leont

On Wed, Nov 19, 2014 at 7​:28 PM, Craig A. Berry <craig.a.berry@​gmail.com>
wrote​:

I wonder if the mix of signals and threads is involved. We're
blocking and then unblocking SIGCHLD using sigprocmask rather than
pthread_sigprocmask. If you unblock SIGCHLD process-wide and have 100
threads in flight, do they all get the signal?

sigprocmask is unspecified in a multithreaded program. Signal masks are per
thread though, not per process. AFAIK on 1​:1 threading implementations
(like Linux) thread_sigmask is sigprocmask; on M​::N implementations
(OpenBSD and old Solaris) it may mask an OS thread instead of a userland
thread (which is not what you want).

More specifically,
does *each* thread increment its own copy of PL_sig_pending whenever
*any* thread gets a signal?

A process targetted signal (which is almost any signal except faults and
pthread_kill) may be delivered to any process, but it will only be
delivered once. Most operating systems have a preference of delivering to
the main thread (it makes background threads in naive programs easier).

Actually, that may well be the issue here. The child threads generate
signals, but the main thread gets them delivered. If it's
blocking/restarting for whatever reason, they will keep pending there even
though the child threads would have been able to handle them.

Not sure how many threads the OP has, but

I got the impression it was more than a handful (one per exam taker or
something).

Threads and signals are a problematic combination. You really don't want to
combine them when you can avoid it.

Leon

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 21, 2014

From d.white@imperial.ac.uk

On 19/11/14 13​:34, I (Duncan White) wrote​:
I spent much of this afternoon running tests using our vanilla Perl
5.16 under conditions of high load, and I can confirm that Perl 5.16
DOES suffer the bug.

Thus, our current position is that Perl 5.14 does NOT suffer the bug,
while vanilla perl 5.16 and several Perl 5.18 builds *DO* suffer the
bug. We further speculate that

https://rt.perl.org/Public/Bug/Display.html?id=105700

(introduced in Perl 5.16) may be the cause of the changed behaviour.

We will continue to experiment with our cutdown versions, attempting
to derive the smallest possible test program exhibiting the
"too many pending signals" behaviour.

Updates of our investigations, plus answering several points Leon and
Craig have made​:

- Leon suggested we run strace -e signal on our code, done, attached.
  Shows lots of signal blocking and lots of sigchlds. Attached,
  although I'm not sure what to deduce from it in this case.

- Leon suggested that signals and threads may interact badly. While
  I appreciate that Perl threads are "deprecated", I should stress
  that my threaded Perl app has worked reliably for >6 years and
  is *still* working reliably when the arbitrary limit of "120
  pending signals" is not breached.

- Craig asked about how many threads I have, yes, typically 40-80,
  one per controlled "slave" PC. All of my recent tests used 80 threads
  and 80 machines, and the probability of crashing does appear to
  increase with number of threads.

- important new info​: Leon suggested that we compile a Perl 5.18.2 with
  patch 105700 removed and report back​: we've done that, and the bug
  DOES NOT APPEAR.

  We've run over 1000 "dump all machines" cycles under conditions of
  high load. Under those conditions every run with the system Perl
  5.18.2 (or any other 5.18.2 build, eg vanilla) hits the "more than
  120 pending signals" error within 100 runs.

  This is substantial evidence that somehow the (sensible) block-
  sigchld-in-system patch introduced in Perl 5.16 by Leon is
  interacting badly with the "more than 120 pending signals"
  patch. Hypothetically, I would say that the block-sigchld-in-system
  patch is "magnifying" or "concentrating" the number of pending
  signals, triggering the "more than 120 pending signals" patch that
  was already there. But that last sentence is speculation.

- To summarise the evidence so far​:

  - we have NOT SEEN the "more than 120 pending signals" bug
  hitting us using any Perl 5.14 build.

  - a vanilla Perl 5.16 build DOES show this bug.

  - only one version of perl 5.18.2 does NOT SHOW this bug - the is
  the above mentioned "with patch 105700 removed".

  - every OTHER version of perl 5.18.2 DOES show this bug.

- The second strand of investigation is cutting down our exam-control
  program so I can share it with you all. I have reduced 2500 lines
  of Perl to ~850 lines so far, ripping out all the "client reconfigures
  itself into exam mode" logic. I attach a tarball of this code
  for you now, it contains a README explaining what I think you'll
  need to test it (a mere 40-80 hosts​:-)). Please ask if anything
  is not clear.

- The final strand of investigation is that I have built a version of
  Perl 5.18.2 with a diagnostic patch to log a time series of numbers
  of pending signals over time, I will now build another version of
  5.18.2 with that patch added AND the block-sigchild-in-system patch
  removed. I will attempt to gather two sets of timeseries data and
  graph them. I **expect** of course that these will show the
  "magnification of pending signals" property - but of course I don't
  know that yet​:-)

My "working conclusion" is still that the 120 pending signals patch is
utterly arbitrary and stupid and (at least in my situation) is
tripping when the Perl interpreter is in no difficulty at all, but that
Perl 5.16 and later have somehow "magnified" this effect horribly.

cheers
Duncan

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 21, 2014

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 21, 2014

From @Leont

On Fri, Nov 21, 2014 at 4​:33 PM, Duncan White <d.white@​imperial.ac.uk>
wrote​:

- Leon suggested we run strace -e signal on our code, done, attached.
Shows lots of signal blocking and lots of sigchlds. Attached,
although I'm not sure what to deduce from it in this case.

Not much more than confirming what we already suspected :-/

- Leon suggested that signals and threads may interact badly. While
I appreciate that Perl threads are "deprecated", I should stress
that my threaded Perl app has worked reliably for >6 years and
is *still* working reliably when the arbitrary limit of "120
pending signals" is not breached.

I didn't mean to suggest this wasn't a serious issue, we do take
regressions serious.

- important new info​: Leon suggested that we compile a Perl 5.18.2 with
patch 105700 removed and report back​: we've done that, and the bug
DOES NOT APPEAR.

Yes, that is very relevant.

  This is substantial evidence that somehow the (sensible) block-

sigchld-in-system patch introduced in Perl 5.16 by Leon is
interacting badly with the "more than 120 pending signals"
patch. Hypothetically, I would say that the block-sigchld-in-system
patch is "magnifying" or "concentrating" the number of pending
signals, triggering the "more than 120 pending signals" patch that
was already there. But that last sentence is speculation.

Agreed. My hypothesis is that it's because of signals being delivered to
other threads than the one that caused them.

- The second strand of investigation is cutting down our exam-control
program so I can share it with you all. I have reduced 2500 lines
of Perl to ~850 lines so far, ripping out all the "client reconfigures
itself into exam mode" logic. I attach a tarball of this code
for you now, it contains a README explaining what I think you'll
need to test it (a mere 40-80 hosts​:-)). Please ask if anything
is not clear.

I have attached a tiny program that shows exactly this problem. The
parameters may be different on other computers but for me 25​:10 works. It
prints out the number of perl-level signal handlers run, this suggests
signals are getting merged on a Perl level (they are far too low).

- The final strand of investigation is that I have built a version of
Perl 5.18.2 with a diagnostic patch to log a time series of numbers
of pending signals over time, I will now build another version of
5.18.2 with that patch added AND the block-sigchild-in-system patch
removed. I will attempt to gather two sets of timeseries data and
graph them. I **expect** of course that these will show the
"magnification of pending signals" property - but of course I don't
know that yet​:-)

Could be interesting. The count is per thread, so some kind of thread id
with each such message may be useful.

Leon

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 21, 2014

From @Leont

test

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 21, 2014

From d.white@imperial.ac.uk

Dear Leon,

On 21/11/14 17​:44, Leon Timmermans wrote​:

On Fri, Nov 21, 2014 at 4​:33 PM, Duncan White <d.white@​imperial.ac.uk
<mailto​:d.white@​imperial.ac.uk>> wrote​:

\- Leon suggested we run strace \-e signal on our code\, done\, attached\.
  Shows lots of signal blocking and lots of sigchlds\.  Attached\,
  although I'm not sure what to deduce from it in this case\.

Not much more than confirming what we already suspected :-/

sure​:-)

\- Leon suggested that signals and threads may interact badly\.  While
  I appreciate that Perl threads are "deprecated"\, I should stress
  that my threaded Perl app has worked reliably for >6 years and
  is \*still\* working reliably when the arbitrary limit of "120
  pending signals" is not breached\.

I didn't mean to suggest this wasn't a serious issue, we do take
regressions serious.

I wasn't criticising, I just wanted to disspell the "here be dragons"
feeling I got from your hint​:-)

\- important new info&#8203;: Leon suggested that we compile a Perl 5\.18\.2 with
  patch 105700 removed and report back&#8203;: we've done that\, and the bug
  DOES NOT APPEAR\.

Yes, that is very relevant.

  This is substantial evidence that somehow the \(sensible\) block\-
  sigchld\-in\-system patch introduced in Perl 5\.16 by Leon is
  interacting badly with the "more than 120 pending signals"
  patch\.  Hypothetically\, I would say that the block\-sigchld\-in\-system
  patch is "magnifying" or "concentrating" the number of pending
  signals\, triggering the "more than 120 pending signals" patch that
  was already there\.  But that last sentence is speculation\.

Agreed. My hypothesis is that it's because of signals being delivered to
other threads than the one that caused them.

Fascinating! could well be right.

I have attached a tiny program that shows exactly this problem. The
parameters may be different on other computers but for me 25​:10 works.
It prints out the number of perl-level signal handlers run, this
suggests signals are getting merged on a Perl level (they are far too low).

Impressive! now that's a proper test case, showing the problem quickly!
I confirm that "25​:10" and "20​:10" trigger the problem for me with our
system perl.

\- The final strand of investigation is that I have built a version of
  Perl 5\.18\.2 with a diagnostic patch to log a time series of numbers
  of pending signals over time\, I will now build another version of
  5\.18\.2 with that patch added AND the block\-sigchild\-in\-system patch
  removed\.  I will attempt to gather two sets of timeseries data and
  graph them\.  I \*\*expect\*\* of course that these will show the
  "magnification of pending signals" property \- but of course I don't
  know that yet&#8203;:\-\)

Could be interesting. The count is per thread, so some kind of thread id
with each such message may be useful.

Umm, as yet I don't know how (in mg.c) to access the Perl thread id,
if you could tell me that I'll give it a try. I had enough difficulty
realising to use Perlio_printf etc rather than fprintf​:-)

I now have some time series data, the (utterly crap) diagnostic patch
that generates it, two time series runs for about a minute, and the
gnuplot commands to draw a graph, and a postscript form of the graph
(output.ps), all in the attached tarball. I should say that the perl
interpreters with the diag patch are really unreliable, they often hang
when running the "server" version in the earlier tarball; to get the
two timeseries I've plotted, I had to ctrl-c and rerun about 5 times​:-)

But I think the data is sound, and shows the "magnification effect"
brilliantly! see the README in the timeseries tarball for more details.

Leon

Good luck tracking things down, I'm off home now. I'll intermittently
check email over the weekend, and catch up properly on Monday.

cheers
Duncan

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 21, 2014

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 21, 2014

From @Leont

On Fri, Nov 21, 2014 at 6​:44 PM, Leon Timmermans <fawaka@​gmail.com> wrote​:

It prints out the number of perl-level signal handlers run, this suggests
signals are getting merged on a Perl level (they are far too low).

Except that I had already removed that, since it didn't prove to be
useful...

Leon

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 21, 2014

From @craigberry

On Fri, Nov 21, 2014 at 12​:42 PM, Duncan White <d.white@​imperial.ac.uk> wrote​:

Dear Leon,

On 21/11/14 17​:44, Leon Timmermans wrote​:

I have attached a tiny program that shows exactly this problem. The
parameters may be different on other computers but for me 25​:10 works.
It prints out the number of perl-level signal handlers run, this
suggests signals are getting merged on a Perl level (they are far too low).

Impressive! now that's a proper test case, showing the problem quickly!
I confirm that "25​:10" and "20​:10" trigger the problem for me with our
system perl.

I worked my way up to a $count of 500 with 100 threads on both Mac OS
X and VMS and it never fails. Those are two threading implementations
that are completely different from each other and (as far as I know)
from Linux. This was with blead, so I guess it's possible we fixed
something else since 5.18.x, but it seems more likely that Linux
thread handling is what's different. I know very little except what
I've read at <http​://www.linuxprogrammingblog.com/all-about-linux-signals?page=show>.
It does say that in addition to a process-wide signal mask there is a
separate signal mask per thread and that you call pthread_sigmask to
set it. We aren't doing that when we block SIGCHLD around a system()
call, so it might be worth a try.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 21, 2014

From @Leont

On Fri, Nov 21, 2014 at 7​:42 PM, Duncan White <d.white@​imperial.ac.uk>
wrote​:

I wasn't criticising, I just wanted to disspell the "here be dragons"
feeling I got from your hint​:-)

This whole class of issues is fairly "here be dragons" if you ask me.

Umm, as yet I don't know how (in mg.c) to access the Perl thread id,
if you could tell me that I'll give it a try. I had enough difficulty
realising to use Perlio_printf etc rather than fprintf​:-)

Getting the perl-level tid is a bit tricky I'm afraid, but printing aTHX
with "%p" would already be quite useful. Actually, PL_curinterp == aTHX
(true in the main thread) would likely be all the information we need.

I now have some time series data, the (utterly crap) diagnostic patch

that generates it, two time series runs for about a minute, and the
gnuplot commands to draw a graph, and a postscript form of the graph
(output.ps), all in the attached tarball. I should say that the perl
interpreters with the diag patch are really unreliable, they often hang
when running the "server" version in the earlier tarball; to get the
two timeseries I've plotted, I had to ctrl-c and rerun about 5 times​:-)

You're doing buffered IO in a signal handler, not-crashing 1 out of 5 times
is not a bad score.

Good luck tracking things down, I'm off home now. I'll intermittently
check email over the weekend, and catch up properly on Monday.

Masking the signal in the main thread appears to make the problem go away,
which would confirm my theory about the signal getting overdelivered to the
main thread. That's not a generic solution though.

Leon

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 28, 2014

From @Leont

On Sat, Nov 22, 2014 at 12​:03 AM, Craig A. Berry <craig.a.berry@​gmail.com>
wrote​:

I worked my way up to a $count of 500 with 100 threads on both Mac OS
X and VMS and it never fails. Those are two threading implementations
that are completely different from each other and (as far as I know)
from Linux. This was with blead, so I guess it's possible we fixed
something else since 5.18.x, but it seems more likely that Linux
thread handling is what's different.

If they don't prefer delivery to the main thread, then will not see this
issue (assuming my hypothesis is correct). POSIX leaves that behavior
unspecified.

Actually this might offer a clean way out. The original problem existed on
*BSD and Darwin but not on Linux. If my script is also non-problematic on
*BSD, the solution may be to only mask on bsdish systems.

I know very little except what
I've read at <
http​://www.linuxprogrammingblog.com/all-about-linux-signals?page=show>.
It does say that in addition to a process-wide signal mask there is a
separate signal mask per thread and that you call pthread_sigmask to
set it. We aren't doing that when we block SIGCHLD around a system()
call, so it might be worth a try.

There are both per-process and per-thread signal queues but nothing else is
both. The signal disposition (handler) is per-process, the signal masks are
per-thread. Did I mention here be dragons?

Leon

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 5, 2014

From d.white@imperial.ac.uk

Dear Leon, all,

On 21/11/14 23​:11, Leon Timmermans wrote​:

On Fri, Nov 21, 2014 at 7​:42 PM, Duncan White <d.white@​imperial.ac.uk
<mailto​:d.white@​imperial.ac.uk>> wrote​:

I wasn't criticising\, I just wanted to disspell the "here be dragons"
feeling I got from your hint&#8203;:\-\)

This whole class of issues is fairly "here be dragons" if you ask me.

Fair enough​:-) Guess I was pretty lucky (only 7 years or more) up to
now then​:-)

Getting the perl-level tid is a bit tricky I'm afraid, but printing aTHX
with "%p" would already be quite useful. Actually, PL_curinterp == aTHX
(true in the main thread) would likely be all the information we need.

I'm afraid I didn't have time to try adding this into the diag patch.

I now have some time series data\, the \(utterly crap\) diagnostic patch
that generates it\, two time series runs for about a minute\, and the
gnuplot commands to draw a graph\, and a postscript form of the graph
\(output\.ps \<http&#8203;://output\.ps>\)\, all in the attached tarball\.  I
should say that the perl
interpreters with the diag patch are really unreliable\, they often hang
when running the "server" version in the earlier tarball; to get the
two timeseries I've plotted\, I had to ctrl\-c and rerun about 5 times&#8203;:\-\)

You're doing buffered IO in a signal handler, not-crashing 1 out of 5
times is not a bad score.

Absolutely​:-)

Good luck tracking things down\, I'm off home now\.  I'll intermittently
check email over the weekend\, and catch up properly on Monday\.

Masking the signal in the main thread appears to make the problem go
away, which would confirm my theory about the signal getting
overdelivered to the main thread. That's not a generic solution though.

Leon

I wanted to check - I wasn't really sure whether you guys were needing
anything more from me, I feel I've given you all the info (including
the diag timeseries patch) to reproduce it - with your brilliant short
example making that easy.

I have a couple of small new things to add​: I explained this perl bug
to a friend of mine, we spent some time simplifying your short example
Perl program, I can reproduce the "too many pending signals (120)"
error with the following ridiculously short version, 4 lines long, only
1 thread created​:

#! /usr/bin/env perl
use threads;
$SIG{CHLD} = sub { };
threads->create( sub { system 'sleep 0' for 1..300; } )->join;

On my machine, this reliably crashes almost instantly. I may
play with my two timeseries producing Perl interpreters (one
with the sigchld-blocking-in-system patch in, one without
it) running the above code, because this is so simple we
might learn something new.

Another version of this, only a few lines longer, made the signal
handler increase a count and print it out in the child thread​:

#! /usr/bin/env perl
use threads;
my $count=0;
$SIG{CHLD} = sub { $count++ };
threads->create(
  sub {
  foreach my $i (1..300)
  {
  system 'sleep 0';
  print "count=$count\n";
  }
  }
)->join;
print "$count sigchilds handled\n";

At first, I forgot to make $count "​: shared" (and use threads​::shared),
and thus ended up with two $count's, one in the main program, the other
in the explicitly created thread. Interestingly, printing the count in
the thread's sub shows it gradually increasing, i.e. the signal handler
was being invoked in the explicit thread. After the thread exits (with
final $count between 10 and 30, say), the main program's $count is
still 0, which then increments to 1 (somewhere around the thread
death/join time).

Of course, when I made it : shared, the only change was that the
final value was the final-inner value + 1,i.e. if the last thing
printed inside the explicit thread was 22, then it printed
"23 sigchilds handled" at the end.

I wondered whether this tells us anything new about signal
handling in Perl.

The final thing to say is that I have worked round my production-
system's problem be writing my own Perl my_system() function,
and calling that instead of system(). That's just a temporary
fix to make stuff work for me, of course. I still think you
guys need to diagnose and then fix Perl so others don't experience
this again.
cheers
Duncan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant