Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t/op/fork.t blocks in external process test on Win32 #11267

Closed
p5pRT opened this issue Apr 18, 2011 · 30 comments
Closed

t/op/fork.t blocks in external process test on Win32 #11267

p5pRT opened this issue Apr 18, 2011 · 30 comments

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Apr 18, 2011

Migrated from rt.perl.org#88840 (status was 'resolved')

Searchable as RT88840$

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 18, 2011

From @tonycoz

Created by @tonycoz

t/op/fork.t blocks running the test code​:

system $^X, "-e", "if (\$pid=fork){kill(9, \$pid)} else {sleep 5}";

This leaves a process running that Ctrl-C does not kill and needs to be
killed with Task Manager.

Simply reproducable with​:

..\perl -I..\lib -le "if ($pid=fork){ kill 9, $pid; } else { sleep 5;}"

Loading a non-trivial module prevents the lockup​:

..\perl -I..\lib -MEncode -le "if ($pid=fork){ kill 9, $pid; } else { sleep 5;}"

Modifying win32.c to include Sleep(1) immediately before the call to
TerminateThread() prevents the lockup. Sleep(0) does not prevent the
problem.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.14.0:

Configured by tony at Mon Apr 18 20:41:14 2011.

Summary of my perl5 (revision 5 version 14 subversion 0) configuration:
  Derived from: d12b49d6e531fdc354980af14eb2efa34756c1d8
  Platform:
    osname=MSWin32, osvers=6.1, archname=MSWin32-x86-multi-thread
    uname=''
    config_args='undef'
    hint=recommended, useposix=true, d_sigaction=undef
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cl', ccflags ='-nologo -GF -W3 -Od -MD -Zi -DDEBUGGING -DWIN32 -D_CONSOLE -DNO_STRICT -D_CRT_SECURE_NO_DEPRECATE -D_CRT_NONSTDC_NO_DEPRECATE  -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO',
    optimize='-Od -MD -Zi -DDEBUGGING',
    cppflags='-DWIN32'
    ccversion='15.00.30729.01', gccversion='', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=8
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='link', ldflags ='-nologo -nodefaultlib -debug  -libpath:"c:\notthere\lib\CORE"  -machine:x86 "/manifestdependency:type='Win32' name='Microsoft.Windows.Common-Controls' version='6.0.0.0' processorArchitecture='*' publicKeyToken='6595b64144ccf1df' language='*'"'
    libpth=\lib
    libs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib
    perllibs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib
    libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl514.lib
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug  -libpath:"c:\notthere\lib\CORE"  -machine:x86 "/manifestdependency:type='Win32' name='Microsoft.Windows.Common-Controls' version='6.0.0.0' processorArchitecture='*' publicKeyToken='6595b64144ccf1df' language='*'"'

Locally applied patches:
    RC0


@INC for perl 5.14.0:
    lib
    C:/Users/tony/dev/perl/git/perl/lib
    .


Environment for perl 5.14.0:
    HOME (unset)
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE;C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN;C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\Tools;C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\Tools\bin;C:\Windows\Microsoft.NET\Framework\;C:\Windows\Microsoft.NET\Framework\\Microsoft .NET Framework 3.5 (Pre-Release Version);C:\Windows\Microsoft.NET\Framework\v2.0.50727;C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\VCPackages;C:\Program Files\Microsoft SDKs\Windows\v7.0\bin;C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\Bin\amd64;C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\vcpackages;C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE;C:\apps\platsdk\win7\Bin\x64;C:\apps\platsdk\win7\Bin;C:\Windows\Microsoft.NET\Framework64\v3.5;C:\Windows\Microsoft.NET\Framework\v3.5;C:\Windows\Microsoft.NET\Framework64\v2.0.50727;C:\Windows\Microsoft.NET\Framework\v2.0.50727;C:\apps\platsdk\win7\Setup;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Common Files\Microsoft Shared\Windows Live;C:\Program Files (x86)\MiKTeX 2.8\miktex\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\WIDCOMM\Bluetooth Software\;C:\Program Files\WIDCOMM\Bluetooth Software\syswow64;C:\Program Files (x86)\QuickTime\QTSystem\;C:\apps\git\Git\cmd
    PERL_BADLANG (unset)
    SHELL (unset)

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 2, 2012

From @bulk88

I am using
__________________________________________________________
C​:\>perl -V
Summary of my perl5 (revision 5 version 17 subversion 3 patch blead
2012-08-02.1
5​:28​:40 4276fd3 v5.17.2-193-g4276fd3)
configura
tion​:
  Snapshot of​: 4276fd3
  Platform​:
  osname=MSWin32, osvers=5.2, archname=MSWin32-x64-multi-thread
  uname=''
  config_args='undef'
  hint=recommended, useposix=true, d_sigaction=undef
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cl', ccflags ='-nologo -GF -W3 -Od -MD -Zi -DDEBUGGING
-fp​:precise -DWIN
32 -D_CONSOLE -DNO_STRICT -DWIN64 -DCONSERVATIVE
-D_CRT_SECURE_NO_DEPRECATE -D_C
RT_NONSTDC_NO_DEPRECATE -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT
-DPERL_
IMPLICIT_SYS -DUSE_PERLIO',
  optimize='-Od -MD -Zi -DDEBUGGING -fp​:precise',
  cppflags='-DWIN32'
  ccversion='15.00.30729.01', gccversion='', gccosandvers=''
  intsize=4, longsize=4, ptrsize=8, doublesize=8, byteorder=12345678
  d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=8
  ivtype='__int64', ivsize=8, nvtype='double', nvsize=8,
Off_t='__int64', lsee
ksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='link', ldflags ='-nologo -nodefaultlib -debug
-libpath​:"c​:\p517\lib\COR
E" -machine​:AMD64 "/manifestdependency​:type='Win32'
name='Microsoft.Windows.Com
mon-Controls' version='6.0.0.0' processorArchitecture='*'
publicKeyToken='6595b6
4144ccf1df' language='*'"'
  libpth=\lib
  libs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib
comdlg32.l
ib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib
uuid.lib ws2_32
.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib
msvcrt.
lib
  perllibs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib
comdlg
32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib
uuid.lib ws
2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib
comctl32.lib msv
crt.lib
  libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl517.lib
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
  cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug
-libpath​:"c​:\p
517\lib\CORE" -machine​:AMD64 "/manifestdependency​:type='Win32'
name='Microsoft.
Windows.Common-Controls' version='6.0.0.0' processorArchitecture='*'
publicKeyTo
ken='6595b64144ccf1df' language='*'"'

Characteristics of this binary (from libperl)​:
  Compile-time options​: DEBUGGING HAS_TIMES HAVE_INTERP_INTERN MULTIPLICITY
  PERLIO_LAYERS PERL_DONT_CREATE_GVSV
  PERL_IMPLICIT_CONTEXT PERL_IMPLICIT_SYS
  PERL_MALLOC_WRAP PERL_PRESERVE_IVUV
  PERL_TRACK_MEMPOOL USE_64_BIT_INT USE_ITHREADS
  USE_LARGE_FILES USE_LOCALE USE_LOCALE_COLLATE
  USE_LOCALE_CTYPE USE_LOCALE_NUMERIC USE_PERLIO
  USE_PERL_ATOF
  Built under MSWin32
  Compiled at Aug 2 2012 17​:01​:32
  @​INC​:
  C​:/p517/site/lib
  C​:/p517/lib
  .

C​:\>
______________________________________________________________
VS 2008 and Server 2003 x64, and a x64 perl.

The /op/fork.t freeze is this being run "perl -e "if ($pid=fork){kill(9,
$pid)} else {sleep 5}"".
The callstack from the only thread at that point in the process is
___________________________________________________
  ntdll.dll!ZwWaitForSingleObject() + 0xa bytes
  ntdll.dll!RtlpWaitOnCriticalSection() - 0x1aa bytes
  ntdll.dll!LdrUnloadDll() - 0x1d9a0 bytes
  kernel32.dll!FreeLibrary() + 0x71 bytes

perl517.dll!VMem​::~VMem() Line 159 C++
  perl517.dll!VMem​::`scalar deleting destructor'() + 0x17 bytes C++
  perl517.dll!VMem​::Release() Line 268 + 0x2b bytes C++
  perl517.dll!CPerlHost​::~CPerlHost() Line 2081 C++
  perl517.dll!CPerlHost​::`scalar deleting destructor'() + 0x17 bytes C++
  perl517.dll!win32_delete_internal_host(void * h=0x0000000000346e60)
Line 202 + 0x2b bytes C++
  perl517.dll!perl_free(interpreter * my_perl=0x0000000001aa6a08)
Line 1335 C
  perl517.dll!RunPerl(int argc=3, char * * argv=0x0000000000343d90, char
* * env=0x00000000003441b0) Line 277 C++
  perl.exe!main(int argc=3, char * * argv=0x0000000000343d90, char * *
env=0x0000000000344820) Line 24 C
  perl.exe!__tmainCRTStartup() Line 586 + 0x19 bytes C
  kernel32.dll!BaseProcessStart() + 0x2c bytes
__________________________________________________
This is the parent "process" in ithreads, the child thread would have a
different callstack begining.

In vmem.h at line 158.
_____________________________________________
VMem​::~VMem(void)
{
#ifdef _USE_LINKED_LIST
  while (m_Dummy.pNext != &m_Dummy) {
  Free(m_Dummy.pNext+1);
  }
#endif
  if (m_hLib)
  FreeLibrary(m_hLib);
  DeleteCriticalSection(&m_cs);
}
____________________________________________

m_hLib is 0x000007ff7fc00000 which is C​:\WINDOWS\system32\msvcrt.dll
according to Process Explorer (HMODULEs are pointers to a memory mapped
DLL PE file header). The problem with vmem.h is that it calls msvcrt.dll
instead of being hard coded to whatever the static linked malloc or
static linked CRT is, so 2 CRTs wind up in the process space always
unless its compiled with Platform SDK or VC 6 (ActivePerl is made this
way) which doesnt use VS specific CRTs but msvcrt.dll for a c std lib.
This is an optimization issue but is not the cause of the deadlock,
since vmem.h and msvcrt.dll have been around for almost a decade ( added
in
http​://perl5.git.perl.org/perl.git/commit/f57e8d3bb1a36000e1951a4d2d5b0104b39bde4b
)

So I got private symbol global variable "LdrpLoaderLock" (google it)
from ntdll and dumped it.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 2, 2012

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 2, 2012

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 2, 2012

From @bulk88

The only thread in the process is PID 45692, which is parent thread
based on callstack. 0x000000000000815c is
33116 in dec. So I presume PID 33116 was the child thread.

So I put a breakpoint on TerminateThread in win32_kill in win32.c and
restart the perl process
____________________________________________________
  case 9​:
  /* kill -9 style un-graceful exit */
  if (TerminateThread(hProcess, sig)) {
  /* Allow the scheduler to finish cleaning up the
other thread.
  * Otherwise, if we ExitProcess() before another
context switch
  * happens we will end up with a process exit code
of "sig" instead
  * of our own exit status.
  * See also​:
https://rt.cpan.org/Ticket/Display.html?id=66016#txn-908976
  */
  Sleep(0);
  remove_dead_pseudo_process(child);
____________________________________________________

Ldr critical section is owned by the child thread (the PID in the
critical section is the child thread at the top tab in VS) at the point
where we will call TerminateThread. Not 1 Perl machine opcode executed
yet in the child thread. The NT DLL loader is calling the PE header
entry point on kernel32.dll (function BaseDllInitialize in kernel32.dll)
and kernel32.dll the dll is setting itself up before running any
!!!win32!!! user mode code.

For more on the behavior of the DLL loader lock, read this by MS
http​://download.microsoft.com/download/a/f/7/af7777e5-7dcd-4800-8a0a-b18336565f5b/dll_bestprac.doc
.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 2, 2012

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 2, 2012

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 3, 2012

From @bulk88

The simplest (it can get much more complicated, I discuss later) fix is
either to wait for the child interp in the child thread to run at
PerlProcFork/CreateThread time (bad, a context switch which 99% of the
isn't needed), or at win32_kill time (good, I explain later).
CreateThread on Windows is lazy, you get back a handle even if the
thread never gets scheduled (
http​://msdn.microsoft.com/en-us/library/ms682453%28VS.85%29.aspx ) or
access vio function pointer to start the thread. So I think for
win32_kill, win32_kill needs to get the PerlInterpreter * of the child
interp, and check if the child interp knows its thread id yet, relying
on this line
http​://perl5.git.perl.org/perl.git/blob/HEAD​:/win32/perlhost.h#l1711 .
At the TerminateThread in win32_kill do something like

if(! child_perl->Isys_intern->psuedo_id){
  Sleep(0);
  if(! child_perl->Isys_intern->psuedo_id){
  Sleep(1); //maybe some sync IO or mutex collision in thread
attach in some dll entry point, how many millsec do we wait?
  if(! child_perl->Isys_intern->psuedo_id){
  croak("panic​: child interp never scheduled");
  }
  }
}

The problem is, I dont think that the PerlInterpreter * of child interp
is stored anywhere in the parent interp, maybe I just can't find where
it is stored (newbie). Anyone know how to get the PerlInterpreter * of
the child interp from the parent interp's PerlInterpreter struct or
somehow else?

Reason it is better to check if the child interp was ever scheduled at
kill time than create time is most often the parent will go do work then
block on IO or wait for a child interp or run bytecode till its quantum
expires, or hits a malloc mutex, or something and the child thread is
scheduled enough times to go through the DLL thread initialization and
reach win32_start_child function. I slightly screwed up the description
of DLL entry points above, they run on process start, and on thread
start (usually, see
http​://msdn.microsoft.com/en-us/library/windows/desktop/ms682579%28v=vs.85%29.aspx
usually called from process attach of dll for dll that doesn't care
about threads) and 2 other cases, see
http​://msdn.microsoft.com/en-us/library/windows/desktop/ms682583%28v=vs.85%29.aspx
for the 4 times dll entry point is called/runs.

A more complicated alternative/higher overhead is creating a 1 time use
event handle
http​://msdn.microsoft.com/en-us/library/windows/desktop/ms682655%28v=vs.85%29.aspx
that win32_start_child activates and wait on it in parent interp at kill
time with a timeout. What the timeout should be IDK.

A more complicated solution is to try to obtain the dll loader lock
using NT Native API,
http​://www.geoffchappell.com/studies/windows/win32/ntdll/api/ldrapi/lockloaderlock.htm
and not call TerminateThread if parent interp can't get the lock, the
solution is not WinCE compatible (is this deprecated/near dead platform
for Perl or not?), and not DOS Windows compatible (code removed in 5.15
(I think) for DOS Windows by Jan Dubois). At one time Perl did use
Native API for NT 4 support but Jan removed ntdll from Perl in
http​://perl5.git.perl.org/perl.git/commitdiff/8cbe99e5b6fe99a6bc17c0b0cee249bac3565da4
.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 3, 2012

From @bulk88

A half baked solution is to not have perlhost.h load a 2nd C std Lib
into the perl OS process, and therefore never need to call FreeLibrary
in ~Vmem destroyer which will freeze the parent interp. Why the loader
lock never times out with a software exception (similar to access vio
GUI wise) IDK enough about windows to say. Any XS library that does a
FreeLibrary or a LoadLibrary, either explicitly or secretly in a pure
C/C++ DLL it wraps, probably even Dynaloader bootstrap will freeze
indefinitely, aslong as you Perl Lang kill 9ed a child fork "process"
earlier in runtime fast enough to never run win32_start_child.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 3, 2012

From @bulk88

I am done, I am waiting for comments from the rest of P5P now.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 3, 2012

From @bulk88

I've been wondering, has this test always failed? or it only fails on
SMP >1 CPU Windows (fails for me on 2 core and 8 core PCs)? Or does it
only fail on fast CPUs? or on Server 2003 (my machine) or NT 6 and not
XP (does for me) or 2000 because kernel scheduler tweaks by MS over the
years. Does it only fail on Visual Studio compiled Perl that uses a VS
specific CRT instead of msvcrt.dll (ActivePerl's C lib) (I blindly
guess, because a FreeLibrary that lowers the reference count but doesn't
free the DLL does so without using the loader lock maybe???). How was
5.16.0 released with this failing supposedly since 5.14 according to
tony cook?

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 5, 2012

From @tonycoz

On Fri, Aug 03, 2012 at 12​:59​:18PM -0700, bulk 88 via RT wrote​:

I've been wondering, has this test always failed? or it only fails on
SMP >1 CPU Windows (fails for me on 2 core and 8 core PCs)? Or does it
only fail on fast CPUs? or on Server 2003 (my machine) or NT 6 and not
XP (does for me) or 2000 because kernel scheduler tweaks by MS over the
years.

This is my suspicion for the cause.

Does it only fail on Visual Studio compiled Perl that uses a VS
specific CRT instead of msvcrt.dll (ActivePerl's C lib) (I blindly
guess, because a FreeLibrary that lowers the reference count but doesn't
free the DLL does so without using the loader lock maybe???). How was
5.16.0 released with this failing supposedly since 5.14 according to
tony cook?

Because no-one with sufficient Win32 knowledge cares enough about
fixing the problems.

Win32 isn't the only platform where blead fails, and has failed for a
long time.

Tony

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 5, 2012

From @tonycoz

On Thu, Aug 02, 2012 at 06​:03​:33PM -0700, bulk 88 via RT wrote​:

A half baked solution is to not have perlhost.h load a 2nd C std Lib
into the perl OS process, and therefore never need to call FreeLibrary
in ~Vmem destroyer which will freeze the parent interp. Why the loader
lock never times out with a software exception (similar to access vio
GUI wise) IDK enough about windows to say. Any XS library that does a
FreeLibrary or a LoadLibrary, either explicitly or secretly in a pure
C/C++ DLL it wraps, probably even Dynaloader bootstrap will freeze
indefinitely, aslong as you Perl Lang kill 9ed a child fork "process"
earlier in runtime fast enough to never run win32_start_child.

The problem is other libraries (eg. any XS module) loads DLLs at
runtime, which means there's always the possibility of the loader lock
causing a problem.

It doesn't need to be an early kill to cause the lockup.

I can see ways to fix this specific instance of the bug - eg. have
win32_kill() wait a little for the child hwnd to be created (see the
default case in the win32_kill() thread kill code.), but it won't
protect against the more general problem that TerminateThread() isn't
a safe function to call.

Tony

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 5, 2012

From @nwc10

On Sun, Aug 05, 2012 at 05​:09​:20PM +1000, Tony Cook wrote​:

On Fri, Aug 03, 2012 at 12​:59​:18PM -0700, bulk 88 via RT wrote​:

Does it only fail on Visual Studio compiled Perl that uses a VS
specific CRT instead of msvcrt.dll (ActivePerl's C lib) (I blindly
guess, because a FreeLibrary that lowers the reference count but doesn't
free the DLL does so without using the loader lock maybe???). How was
5.16.0 released with this failing supposedly since 5.14 according to
tony cook?

Because it's not a regression, Ricardo has no way to compel anyone to fix
it, and if the choice is between delaying a release indefinitely and
making a release with known problems, releasing with known problems is the
lesser of two evils.

Because no-one with sufficient Win32 knowledge cares enough about
fixing the problems.

There actually aren't very many people working on the perl core.
(I'd describe it as about 4 and 4 halves worth of very active contributors,
and it's been this way for years. This surprises people. Ohloh has graphs,
eg https://www.ohloh.net/p/perl/contributors/summary )

Even out into the "long tail" there is very little knowledge of Win32
specifics. As there's also plenty of other stuff that needs doing, pretty
much everyone elects to do things they are able to do well, as that gets
the best return soonest. So Win32 doesn't get worked on that way.

And as there's no hiring organisation in "charge" of Perl 5, it means that
it doesn't get worked on the "other" way, by paying people to work on the
things that no volunteer volunteers to work on.

Win32 isn't the only platform where blead fails, and has failed for a
long time.

And whilst we'd love it to be otherwise, it's really not obvious how to
get from where we are to where we want to be, short of hoping that someone
(or someones) volunteers to start digging into fixing it.

There's no hostility to Win32 (or any other platform). It's just that there
are no active core developers who use it as a primary development platform.

Nicholas Clark

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 6, 2012

From @jandubois

On Sun, 05 Aug 2012, Tony Cook wrote​:

On Fri, Aug 03, 2012 at 12​:59​:18PM -0700, bulk 88 via RT wrote​:

I've been wondering, has this test always failed? or it only fails on
SMP >1 CPU Windows (fails for me on 2 core and 8 core PCs)? Or does it
only fail on fast CPUs? or on Server 2003 (my machine) or NT 6 and not
XP (does for me) or 2000 because kernel scheduler tweaks by MS over the
years.

This is my suspicion for the cause.

I suspect it also doesn't happen when you link Perl against MSVCRT.dll
because LoadLibrary("MSVCRT.dll") then just increments the reference
count and won't hold the loader lock for very long. Both VC6 and MinGW
builds use MSVCRT.

Cheers,
-Jan

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 6, 2012

From @jandubois

On Sun, 05 Aug 2012, Tony Cook wrote​:

On Thu, Aug 02, 2012 at 06​:03​:33PM -0700, bulk 88 via RT wrote​:

A half baked solution is to not have perlhost.h load a 2nd C std Lib
into the perl OS process, and therefore never need to call FreeLibrary
in ~Vmem destroyer which will freeze the parent interp. Why the loader
lock never times out with a software exception (similar to access vio
GUI wise) IDK enough about windows to say. Any XS library that does a
FreeLibrary or a LoadLibrary, either explicitly or secretly in a pure
C/C++ DLL it wraps, probably even Dynaloader bootstrap will freeze
indefinitely, aslong as you Perl Lang kill 9ed a child fork "process"
earlier in runtime fast enough to never run win32_start_child.

I wonder if there isn't a simpler solution​: Just add

#undef malloc
#undef free
#undef realloc

after all the Perl headers have been #included, and then later
assign

  m_pfree = (LPFREE)free;

etc and get rid of the whole LoadLibrary/FreeLibrary stuff. That way VMem
will just use the current CRT versions of these functions. I suspect the
whole LoadLibrary/FreeLibrary dance is only done because the Perl headers
redefine malloc to Perl_malloc, and VMem needs to provide the implementation
to Perl_malloc, so can't use the redefined malloc() functions. Just getting
rid of the wrapper macros should solve the issue though.

There are more chances for code cleanup, but someone who observes the
problem should try this "fix" to see if it works first. :)

[...] It won't
protect against the more general problem that TerminateThread() isn't
a safe function to call.

Yes, the whole fork() emulation stuff is rather brittle and can never cease
to be "experimental". I wish we could just get rid of it, but CPAN modules
seem to rely on it to use forking regression tests on Windows, so I guess it
is at least marginally useful.

Cheers,
-Jan

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 7, 2012

From @bulk88

Another more rare way of deadlocking, child thread was killed while
holding malloc/heap lock, on fast kill 9 on psuedo fork process test.
_______________________________________________________
  ntdll.dll!_KiFastSystemCallRet@​0()
  ntdll.dll!_ZwWaitForSingleObject@​12() + 0xc
  ntdll.dll!_RtlpWaitForCriticalSection@​4() + 0x8c
  ntdll.dll!_RtlEnterCriticalSection@​4() + 0x46
  ntdll.dll!_RtlDebugFreeHeap@​12() + 0x6a
  ntdll.dll!_RtlFreeHeapSlowly@​12() + 0x246cf
  ntdll.dll!_RtlFreeHeap@​12() + 0x17646
  msvcrt.dll!_free() + 0xc3

perl510.dll!VMem​::Free(void * pMem=0x0183780c) Line 230 C++
  perl510.dll!PerlMemFree(IPerlMem * piPerl=0x002342a4, void *
ptr=0x0183780c) Line 310 C++
  perl510.dll!Perl_safesysfree(void * where=0x0183780c) Line 250 + 0xe C
  perl510.dll!Perl_sv_clear(interpreter * my_perl=0x00235ffc, sv *
sv=0x00000006) Line 5260 C
  perl510.dll!Perl_sv_free2(interpreter * my_perl=0x00235ffc, sv *
sv=0x0182bee4) Line 5367 C
  perl510.dll!Perl_sv_free(interpreter * my_perl=0x00235ffc, sv *
sv=0x0182bee4) Line 5345 + 0xa C
  perl510.dll!Perl_av_undef(interpreter * my_perl=0x00235ffc, av *
av=0x0023ad5c) Line 485 + 0xd C
  perl510.dll!Perl_sv_clear(interpreter * my_perl=0x0183780c, sv *
sv=0x0140fa94) Line 5191 + 0x7 C
  perl510.dll!Perl_sv_2pv_flags(interpreter * my_perl=0x00000ff0, sv *
sv=0x00081414, unsigned int * lp=0x0000007c, long flags=4080) Line 2776 C
  0000007b()
_________________________________________________
Symbols files werent perfect for this run but it shows of the point of
deadlock on heap lock. I will have more comments in the next day or 2 on
this bug.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 7, 2012

From @bulk88

https://rt.cpan.org/Ticket/Display.html?id=66016 one of these is a
duplicate of the other, or atleast they are related. Someone link them
in RT. I only noticed the older bug just now.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 7, 2012

From @bulk88

On Mon Aug 06 21​:15​:09 2012, bulk88 wrote​:

https://rt.cpan.org/Ticket/Display.html?id=66016 one of these is a
duplicate of the other, or atleast they are related. Someone link them
in RT. I only noticed the older bug just now.
NVM. Different RTs.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 9, 2012

From @bulk88

Wrote up a patch. fork.t/"fast kill 9 after fork" stopped deadlocking
for me. I wonder if the 5 ms delay is enough (I really dont know). I
came up with the 5 ms delay from the previous code that did 5 "Sleep(0)"s.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 9, 2012

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 10, 2012

From @bulk88

On Wed Aug 08 17​:59​:27 2012, bulk88 wrote​:

Wrote up a patch. fork.t/"fast kill 9 after fork" stopped deadlocking
for me. I wonder if the 5 ms delay is enough (I really dont know). I
came up with the 5 ms delay from the previous code that did 5 "Sleep(0)"s.

My last post made it to Perl RT but not P5P. This one hopefully will.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 10, 2012

From @tonycoz

On Wed, Aug 08, 2012 at 05​:59​:28PM -0700, bulk 88 via RT wrote​:

Wrote up a patch. fork.t/"fast kill 9 after fork" stopped deadlocking
for me. I wonder if the 5 ms delay is enough (I really dont know). I
came up with the 5 ms delay from the previous code that did 5 "Sleep(0)"s.

I tried this and ran t/op/fork.t in a loop.

First time it deadlocked after 7 runs, second it deadlocked after 27
runs.

I increased delay to 10 ms and it successfully ran the test 500 times.

Also, please use git format-patch to produce patches.

Tony

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 10, 2012

From @bulk88

On Fri Aug 10 00​:33​:07 2012, tonyc wrote​:

I tried this and ran t/op/fork.t in a loop.

First time it deadlocked after 7 runs, second it deadlocked after 27
runs.

I increased delay to 10 ms and it successfully ran the test 500 times.

Also, please use git format-patch to produce patches.

Tony

I wrote a script
_______________________________
for(0..1000) {
  system $^X, "-e", "if (\$pid=fork){sleep 1;kill(9, \$pid)} else
{sleep 5}";
}
print "done\n";
_______________________________
Ran it twice on a DEBUGGING 5.17 with my patch, each one succeeded, no
deadlock and no timeout message (this machine did previously deadlock as
I described above). Did it deadlock (no message) for you or it hit the
croak I put in? Are you using a DEBUGGING perl or not (according to
smokes, it only affects DEBUGGING builds,
http​://www.nntp.perl.org/group/perl.daily-build.reports/2012/08/msg125162.html
)? Did you change the ms time to 10 (so 10 x Sleep(1)s happened), or you
put in a Sleep(10)?

If it deadlocked for you, I am very interested in that, since my patch
is supposed to die or succeed, not deadlock. The only thing I can think
of, unless you can produce a callstack that shows otherwise (that would
be greatly appreciated), you are probably deadlocking on the heap
critical section (but how timing wise? child interp is supposed to be in
a sleep 5;, is it really taking that long from the hwnd message send,
thus unlocking get_hwnd_delay to runops calling the sleep opcode
function?), since the DLL loader lock is guaranteed to not be on in the
child interp unless your doing XS. In child interp's VMem, the
LoadLibrary on msvcrt.dll is done in the parent interp thread and it
does use the loader lock even though msvcrt.dll is already in the
process from the parent interp's VMem. Why MS uses the loader lock on a
refcount increase only LoadLibrary IDK.
____________________________________________________________

ntdll.dll!RtlEnterCriticalSection() + 0xa bytes
  ntdll.dll!LdrLockLoaderLock() + 0x78 bytes
  ntdll.dll!LdrLoadDll() + 0xeb bytes
  kernel32.dll!LoadLibraryExW() + 0x120 bytes
  kernel32.dll!LoadLibraryA() + 0x86 bytes
  perl517.dll!VMem​::VMem() Line 142 + 0xd bytes C++
  perl517.dll!CPerlHost​::CPerlHost() Line 1949 + 0x21 bytes C++
  perl517.dll!perl_alloc() Line 179 + 0x21 bytes C++
  perl517.dll!RunPerl(int argc=0x00000003, char * *
argv=0x0000000000343d90, char * * env=0x0000000000344820) Line 248 +
0x5 bytes C++
  perl.exe!main(int argc=0x00000003, char * * argv=0x0000000000343d90,
char * * env=0x0000000000344820) Line 24 C
  perl.exe!__tmainCRTStartup() Line 586 + 0x19 bytes C
  kernel32.dll!BaseProcessStart() + 0x2c bytes

________________________________________________________
The FreeLibrary of the child interp's VMem happens in the child thread,
unlike the LoadLibrary of the child interp (" perl -e "if
($pid=fork){waitpid($pid,0);} else {sleep 1;}"") but this code is never
reached in a early kill 9 after fork scenario. I stepped through the
FreeLibrary asm, it does not get/use the loader lock unless refcount
reaches 0 and instead uses inlined Interlocked* asm ops to decrease
refcount. So even if get_hwnd_delay took 5 clock seconds (a better
question is why? Perl's fault or user's fault?) (the Perl sleep period
in the child fork), since FreeLibrary doesn't get the loader lock, a
TerminateThread wouldn't deadlock there, unless, TerminateThread was
called after the win32_start_child "exited" and was in DLL_THREAD_DETACH
which will have the loader lock. I haven't seen this scenario happen on
my machine.
________________________________________________________

kernel32.dll!FreeLibrary()
  perl517.dll!VMem​::~VMem() Line 159 C++
  perl517.dll!VMem​::`scalar deleting destructor'() + 0x17 bytes C++
  perl517.dll!VMem​::Release() Line 268 + 0x2b bytes C++
  perl517.dll!CPerlHost​::~CPerlHost() Line 2081 C++
  perl517.dll!CPerlHost​::`scalar deleting destructor'() + 0x17 bytes C++
  perl517.dll!win32_delete_internal_host(void * h=0x000000000034aae0)
Line 202 + 0x2b bytes C++
  perl517.dll!perl_free(interpreter * my_perl=0x0000000001af8888)
Line 1335 C
  perl517.dll!win32_start_child(void * arg=0x0000000001af8888) Line
1804 C++
  kernel32.dll!BaseThreadStart() + 0x3a bytes
________________________________________________________

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 10, 2012

From @bulk88

0001-fix-RT-88840-don-t-terminate-a-child-fork-psuedo-pro.patch
From 2d681a6d132537b5febb79b96d5169b822b510cc Mon Sep 17 00:00:00 2001
From: Daniel Dragan <bulk88@hotmail.com>
Date: Wed, 8 Aug 2012 20:46:31 -0400
Subject: [PATCH] fix RT#88840, don't terminate a child fork psuedo process in
 DLL Loader Lock

TerminateThread will terminate a thread but leaks all resources
of that thread, and all locks will never be released, as documented in MSDN.
There is no alternative to locks not being released that I see, but atleast
-e "if ($pid=fork){kill(9,$pid)} else {sleep 5}"
in fork.t won't deadlock with this patch since win32_start_child be reached before
TerminateThread happens. The 5 ms timeout can be increased if problems
arise in the future. The HWND of the child is delivered by win32_start_child
very early, before any perl bytecode is executed, therefore the delay is
keeping in spirit with a kill 9. In any case, if the child thread fails
to schedule, (a DllMain in DLL_THREAD_ATTACH of some DLL in the process
deadlocks or does very long (over 5 ms right now) sync IO), the parent interp
will bail out.
---
 win32/win32.c |   88 ++++++++++++++++++++++++++++++++++++++------------------
 1 files changed, 60 insertions(+), 28 deletions(-)

diff --git a/win32/win32.c b/win32/win32.c
index 211ca6f..54159ca 100644
--- a/win32/win32.c
+++ b/win32/win32.c
@@ -1271,6 +1271,46 @@ my_kill(int pid, int sig)
     return retval;
 }
 
+
+
+#ifdef USE_ITHREADS
+/*
+    will get a child psuedo process hwnd, with retrying and sleeping
+    success is hwnd != INVALID_HANDLE_VALUE, so NULL HWND can be returned
+    ms is milliseconds to sleep/tries, each try is 1 millisec, fatally
+    errors if child psuedo process doesn't schedule and deliver a HWND in the
+    time period specified, 0 milliseconds causes only Sleep(0) to be used
+    with "no" OS delay being given to the calling thread, 0 msis not recommended
+*/
+static HWND
+get_hwnd_delay(pTHX, long child, DWORD ms){
+    HWND hwnd = w32_pseudo_child_message_hwnds[child];
+/* pseudo-process has not yet properly initialized if hwnd isn't set */
+    if (hwnd != INVALID_HANDLE_VALUE) return hwnd;
+/*fast sleep, on some NT Kernels/systems, a Sleep(0) won't deschedule a
+thread 100% of the time since threads are sticked to a CPU for NUMA
+and caching reasons, and the child thread was stickied to a different CPU
+therefore there is no workload on that CPU, and Sleep(0) returns control
+without yielding the time slot
+https://rt.perl.org/rt3/Ticket/Display.html?id=88840
+*/
+    Sleep(0);
+    win32_async_check(aTHX);
+	hwnd = w32_pseudo_child_message_hwnds[child];
+    if (hwnd != INVALID_HANDLE_VALUE) return hwnd;
+    {
+        int count = 0;
+        while (count++ < ms){ /*ms=0 no Sleep(1),just fail by now*/
+            Sleep(1);
+            win32_async_check(aTHX);
+            hwnd = w32_pseudo_child_message_hwnds[child];
+            if (hwnd != INVALID_HANDLE_VALUE) return hwnd;                        
+        }
+    }
+    Perl_croak(aTHX_ "panic: child psuedo process was never scheduled");
+}
+#endif
+
 DllExport int
 win32_kill(int pid, int sig)
 {
@@ -1281,15 +1321,16 @@ win32_kill(int pid, int sig)
 	/* it is a pseudo-forked child */
 	child = find_pseudo_pid(-pid);
 	if (child >= 0) {
-            HWND hwnd = w32_pseudo_child_message_hwnds[child];
 	    HANDLE hProcess = w32_pseudo_child_handles[child];
 	    switch (sig) {
 	    case 0:
 		/* "Does process exist?" use of kill */
 		return 0;
 
-	    case 9:
+	    case 9: {
                 /* kill -9 style un-graceful exit */
+/*do a wait to make sure child starts and isnt in DLL Loader Lock*/
+            HWND hwnd = get_hwnd_delay(aTHX, child, 5);/*XXX change delay*/
 	    	if (TerminateThread(hProcess, sig)) {
                     /* Allow the scheduler to finish cleaning up the other thread.
                      * Otherwise, if we ExitProcess() before another context switch
@@ -1301,36 +1342,27 @@ win32_kill(int pid, int sig)
 		    remove_dead_pseudo_process(child);
 		    return 0;
 	    	}
+	    }
 		break;
 
 	    default: {
-                int count = 0;
-                /* pseudo-process has not yet properly initialized if hwnd isn't set */
-                while (hwnd == INVALID_HANDLE_VALUE && count < 5) {
-                    /* Yield and wait for the other thread to send us its message_hwnd */
-                    Sleep(0);
-                    win32_async_check(aTHX);
-		    hwnd = w32_pseudo_child_message_hwnds[child];
-                    ++count;
-                }
-                if (hwnd != INVALID_HANDLE_VALUE) {
-                    /* We fake signals to pseudo-processes using Win32
-                     * message queue.  In Win9X the pids are negative already. */
-                    if ((hwnd != NULL && PostMessage(hwnd, WM_USER_KILL, sig, 0)) ||
-                        PostThreadMessage(-pid, WM_USER_KILL, sig, 0))
-                    {
-                        /* Don't wait for child process to terminate after we send a SIGTERM
-                         * because the child may be blocked in a system call and never receive
-                         * the signal.
-                         */
-                        if (sig == SIGTERM) {
-                            Sleep(0);
-                            w32_pseudo_child_sigterm[child] = 1;
-                        }
-                        /* It might be us ... */
-                        PERL_ASYNC_CHECK();
-                        return 0;
+                HWND hwnd = get_hwnd_delay(aTHX, child, 5);/*XXX change delay*/
+                /* We fake signals to pseudo-processes using Win32
+                 * message queue.  In Win9X the pids are negative already. */
+                if ((hwnd != NULL && PostMessage(hwnd, WM_USER_KILL, sig, 0)) ||
+                    PostThreadMessage(-pid, WM_USER_KILL, sig, 0))
+                {
+                    /* Don't wait for child process to terminate after we send a SIGTERM
+                     * because the child may be blocked in a system call and never receive
+                     * the signal.
+                     */
+                    if (sig == SIGTERM) {
+                        Sleep(0);
+                        w32_pseudo_child_sigterm[child] = 1;
                     }
+                    /* It might be us ... */
+                    PERL_ASYNC_CHECK();
+                    return 0;
                 }
 		break;
             }
-- 
1.7.9.msysgit.0

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 12, 2012

From @tonycoz

On Fri, Aug 10, 2012 at 09​:42​:22AM -0700, bulk 88 via RT wrote​:

On Fri Aug 10 00​:33​:07 2012, tonyc wrote​:

I tried this and ran t/op/fork.t in a loop.

First time it deadlocked after 7 runs, second it deadlocked after 27
runs.

I increased delay to 10 ms and it successfully ran the test 500 times.

Also, please use git format-patch to produce patches.

Tony

I wrote a script

If it deadlocked for you, I am very interested in that, since my patch
is supposed to die or succeed, not deadlock. The only thing I can think
of, unless you can produce a callstack that shows otherwise (that would
be greatly appreciated), you are probably deadlocking on the heap
critical section (but how timing wise? child interp is supposed to be in
a sleep 5;, is it really taking that long from the hwnd message send,
thus unlocking get_hwnd_delay to runops calling the sleep opcode
function?), since the DLL loader lock is guaranteed to not be on in the
child interp unless your doing XS. In child interp's VMem, the
LoadLibrary on msvcrt.dll is done in the parent interp thread and it
does use the loader lock even though msvcrt.dll is already in the
process from the parent interp's VMem. Why MS uses the loader lock on a
refcount increase only LoadLibrary IDK.

I rebuilt from scratch and ran my test again, and couldn't reproduce
the failures.

Sorry for the noise.

So I'm happy with this change.

Tony

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 14, 2012

From @bulk88

Can someone apply the patch to perl git?

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 14, 2012

From @steve-m-hay

bulk 88 via RT wrote on 2012-08-14​:

Can someone apply the patch to perl git?

---
via perlbug​: queue​: perl5 status​: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=88840

Apologies for the delay. I will test it and expect to apply it in a few hours. Thanks for your contributions!

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 14, 2012

From @steve-m-hay

On Tue Aug 14 01​:15​:45 2012, Steve.Hay@​verosoftware.com wrote​:

bulk 88 via RT wrote on 2012-08-14​:

Can someone apply the patch to
perl git?

---
via perlbug​: queue​: perl5 status​: open

https://rt-archive.perl.org/perl5/Ticket/Display.html?id=88840

Apologies
for the delay. I will test it and expect to apply it in a few hours.
Thanks for your contributions!

Thanks again, now committed in d903973.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 14, 2012

@steve-m-hay - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.