Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastCGI mapserv gets unexpected signal 11 when apache attempts to shutdown the process #5254

Open
akrherz opened this issue Feb 23, 2016 · 15 comments

Comments

@akrherz
Copy link
Contributor

akrherz commented Feb 23, 2016

Using a recent mapserv with FastCGI (v2.4.0) support, I find the following logged when Apache attempts to shutdown the process when it has reached its maximum request lifetime (ie FcgidMaxRequestsPerProcess 10000).

Feb 23 10:07:39 xxx httpd[5156]: [fcgid:error] [pid 5156] mod_fcgid: process /xxx/mapserv.fcgi(24131) exit(lifetime expired), get unexpected signal 11

I assume this means an 'unclean' shutdown. I don't see any bad impacts from this, but it would be nice to resolve to keep the logs clean :) I am at a loss on how to properly debug this and provide you with some more to go on. I can easily reproduce this on my development system, so I'd be happy to do whatever you need to provide more detail!

PS, here's my question on the forums email list a few years back.

@tbonfort
Copy link
Member

I tried to quickly reproduce this by setting FcgidMaxRequestsPerProcess to 2, but was not able to get that error message. Try compiling a debug version of the mapserv binary and configuring apache to dump a core file so we can at least have a backtrace

@akrherz
Copy link
Contributor Author

akrherz commented Feb 24, 2016

Hi @tbonfort , thanks so much for your continued help. So I attempted to make a debug build cmake .. -DCMAKE_BUILD_TYPE=Debug and I can't reproduce with that build :( Apache's error log simply fills with

In msCleanupOnSignal.
In msCleanupOnSignal.
In msCleanupOnSignal.
In msCleanupOnSignal.

which is expected. Without the debug build, I am also struggling to get apache to dump cores. Maybe the signal 11 is some other issue? I'll keep after it and try to give you something to work off of. Thanks for your patience.

@akrherz
Copy link
Contributor Author

akrherz commented Feb 24, 2016

A miracle occurred and it dropped a core :)

# gdb mapserv.fcgi core.10723 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/akrherz/projects/iem/cgi-bin/mapserv/mapserv.fcgi...done.
[New LWP 10723]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/mesonet/www/apps/iemwebsite/cgi-bin/mapserv/mapserv.fcgi'.
Program terminated with signal 11, Segmentation fault.
#0  __GI_getenv (name=0x7fce5a2f4645 "L_DEBUG") at getenv.c:89
89        if (name_start == ep_start && !strncmp (*ep + 2, name, len)
(gdb) bt full
#0  __GI_getenv (name=0x7fce5a2f4645 "L_DEBUG") at getenv.c:89
        ep_start = <error reading variable ep_start (Cannot access memory at address 0x4804370)>
        len = 7
        ep = 0x20f2fc8
        name_start = 20547
#1  0x00007fce59f08536 in CPLGetConfigOption (pszKey=0x7fce5a2f4643 "CPL_DEBUG", pszDefault=0x0) at cpl_conv.cpp:1569
        pszResult = 0x0
        papszTLConfigOptions = <optimized out>
#2  0x00007fce59f0b0c2 in CPLDebug (pszCategory=0x7fce5a2cf513 "GDAL", pszFormat=0x7fce5a2f607d "GDALClose(%s, this=%p)") at cpl_error.cpp:383
        psCtx = 0x20f1ae0
        pszMessage = <optimized out>
        args = {{gp_offset = 290455933, fp_offset = 0, overflow_arg_area = 0x7fce5a6d4c80, reg_save_area = 0x0}}
        pszDebug = <optimized out>
#3  0x00007fce59eda1b2 in GDALDataset::~GDALDataset (this=0x20f3db0, __in_chrg=<optimized out>) at gdaldataset.cpp:253
        i = <optimized out>
#4  0x00007fce59ce9495 in GTiffDataset::~GTiffDataset (this=0x20f3db0, __in_chrg=<optimized out>) at geotiff.cpp:3575
No locals.
#5  0x00007fce59ce9739 in GTiffDataset::~GTiffDataset (this=0x20f3db0, __in_chrg=<optimized out>) at geotiff.cpp:3579
No locals.
#6  0x00007fce59edb31c in GDALClose (hDS=0x20f3db0) at gdaldataset.cpp:2508
        poDS = 0x20f3db0
        oHolder = {hMutex = 0x20f14d0, pszFile = 0x7fce5a2f6094 "gdaldataset.cpp", nLine = 2488}
        oLocaleForcer = {pszOldLocale = 0x0}
#7  0x00007fce5c320622 in msGDALCleanup () at /home/akrherz/BUILD/mapserver/mapgdal.c:96
        i = 0
        pahDSList = 0x2091600
        nDSCount = 1
        bDidSomething = 0
        iRepeat = 5
#8  0x00007fce5c2a2d52 in msCleanup () at /home/akrherz/BUILD/mapserver/maputil.c:1944
No locals.
#9  0x000000000040126d in msCleanupOnSignal (nInData=15) at /home/akrherz/BUILD/mapserver/mapserv.c:56
No locals.
#10 <signal handler called>
No locals.
#11 0x00007fce58b6cf1e in __accept_nocancel () at ../sysdeps/unix/syscall-template.S:81
No locals.
#12 0x00007fce5bf479ea in OS_Accept (listen_sock=0, fail_on_intr=0, webServerAddrs=0x0) at os_unix.c:1175
        len = 112
        socket = -1
        sa = {un = {sun_family = 1537, 
            sun_path = "\000\001\000\000\000\000\200\251\v\003", '\000' <repeats 12 times>, "`\264\v\003", '\000' <repeats 12 times>, "\200\251\v\003\000\000\000\000\001\003\000\001\000\b\000\000\000\000\000\000\000\000\000\000\360\255\215=\376\177\000\000\060\060\017\002\000\000\000\000\340\222\017\002\000\000\000\000\b\244\024\\\316\177\000\000\220\256\215=\376\177\000\000\000\206e\016\217D"}, in = {sin_family = 1537, sin_port = 256, sin_addr = {s_addr = 0}, 
            sin_zero = "\200\251\v\003\000\000\000"}}
#13 0x00007fce5bf459ac in FCGX_Accept_r (reqDataPtr=0x7fce5c14a3e0 <the_request>) at fcgiapp.c:2196
No locals.
#14 0x00007fce5bf45ae5 in FCGX_Accept (in=in@entry=0x7ffe3d8dae78, out=out@entry=0x7ffe3d8dae80, err=err@entry=0x7ffe3d8dae88, envp=envp@entry=0x7ffe3d8dae90)
---Type <return> to continue, or q <return> to quit---
    at fcgiapp.c:2141
        rc = <optimized out>
#15 0x00007fce5bf45c6a in FCGI_Accept () at fcgi_stdio.c:140
        in = 0x7fce5c2c05a7 <msResetErrorList+13>
        out = 0x30b7840
        error = 0x20912b0
        envp = 0x7fce5c68f3e0 <ms_error.10688>
        acceptResult = <optimized out>
#16 0x000000000040174e in main (argc=1, argv=0x7ffe3d8db008) at /home/akrherz/BUILD/mapserver/mapserv.c:239
        iArg = 1
        sendheaders = 1
        execstarttime = {tv_sec = 0, tv_usec = 0}
        execendtime = {tv_sec = 1, tv_usec = 4200493}
        requeststarttime = {tv_sec = 1, tv_usec = 140729931116432}
        requestendtime = {tv_sec = 140729931116552, tv_usec = 140524284956864}
        mapserv = 0x20912b0
(gdb) 

@tbonfort
Copy link
Member

/Maybe/ the fact we were calling printf inside the signal handler caused an issue. Can you try out the referenced commit and report back ?

@akrherz
Copy link
Contributor Author

akrherz commented Feb 25, 2016

My deleted comment spoke too soon, I reproduced the issue. Interestingly, it only happened on 4 exits out of ~410 total exits. Which is about the reproducability that I was seeing before. I feel like I am forgetting a bug in GDAL that I looked at a number of years back, I am getting old, uffties! This is GDAL 1.11.4, btw

@hschiebold
Copy link

@akrherz Hi, we have the same issue here. Could you solve the problem with the mapserv.c version or is it a bug in GDAL? We have GDAL 1.11.3 at the moment.

@akrherz
Copy link
Contributor Author

akrherz commented Mar 31, 2016

@hschiebold I have not found a solution to the problem yet.

@jratike80
Copy link

Has anybody tested if that commit fixed the issue?

@akrherz
Copy link
Contributor Author

akrherz commented Oct 21, 2019

@jratike80 It did not fix the issue for me.

@akrherz
Copy link
Contributor Author

akrherz commented Jan 17, 2023

Just to denote that I am still seeing this with mapserver 8.0.0 and to cross-reference #6385

@akrherz
Copy link
Contributor Author

akrherz commented Jan 17, 2023

To speculate what is happening, mod-fcgi is sending SIGTERM to the process, which calls msCleanupOnSignal to set finish_process=1, but this change does not do anything until the next fcgi request comes to the process, which likely will not happen since mod-fcgi has marked the process for killing. If I attempt to do this manually:

apache    102411  0.0  0.2 150728 35076 ?        S    13:47   0:00 /opt/iem/cgi-bin/mapserv/mapser

$ kill 102411
# The process is still there
apache    102411  0.1  0.2 150728 35076 ?        S    13:47   0:00 /opt/iem/cgi-bin/mapserv/mapserv

I send a new fcgi request to apache and boom

laptop root[2384]: [2023-01-17 14:03:22.652278] [fcgid:warn] [client mod_fcgid: cleanup zombie process 102411

Am I on the right track @rouault ?

@rouault
Copy link
Contributor

rouault commented Jan 17, 2023

Am I on the right track @rouault ?

Not sure, this is a bit black magic for me too. But I'm wondering if we shouldn't just remove any explicit signal handling in mapserver. Can you try to just remove the following lines in mapserv.c:

ifndef WIN32
  signal( SIGUSR1, msCleanupOnSignal );
  signal( SIGTERM, msCleanupOnSignal );
#endif

I see that libfcgi installs signal handlers on SIGUSR1 in https://github.com/BlueBrain/FastCGI/blob/master/libfcgi/os_unix.c. So it might be well possible that our overriding of it causes more damages than benefits.

@akrherz
Copy link
Contributor Author

akrherz commented Jan 17, 2023

Thanks @rouault , so what I tried

  1. Remove the two signal handlers per your suggestion
  2. Manually kill <pid> and observe that the process immediately dies and apache logs mod_fcgid: cleanup zombie process

Now I try

  1. setting FcgidMaxRequestsPerProcess 3 and restart apache
  2. send some requests and observe mod-fcgi killing appears to function properly (at least no errors are logged)

I'll run this change in my "production" cough and see if there's any troubles. Thanks

@akrherz
Copy link
Contributor Author

akrherz commented Jan 18, 2023

FWIW, the removal of the signal handlers appears to be working. I am unsure what to suggest though for a proper code change. Looking at the code, it seems like these handlers are only ensuring that stdout gets flushed / a current request being serviced gets finished before exiting?

@geographika
Copy link
Member

From the original commit: fbc84bc:

Implemented signal catching on unix (non-win32) platforms. The signal catcher calls msCleanup() which most importantly will close database connections open in the pool. This is most useful for the fastcgi case, but it could be good in a pure cgi case too where a long running cgi is killed for some reason.

So the primary reason is to close db connections (note it is not implemented on Windows, so I'm not sure if it crashes db connections remain open)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants