Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemon segfault after a few graceful reloads with multiple threads #81

Closed
hellsworth opened this issue Jul 10, 2015 · 14 comments
Closed

Comments

@hellsworth
Copy link

We are running on our own Debian Jessie derivative with apache 2.4.10 and mod-wsgi (4.4.13) and experience a segfault upon multiple reloads of apache. The segfault also seems to be dependent on amount of time between reloads. For example, if we do for i inseq 1 10; do echo $i; sudo service apache2 reload; sleep 2; done; it will fail on the 7th reload every time. If we increase the sleep in between reloads to sleep 3, all 10 reloads are issued successfully every time.

Our Jessie derivative is the base OS for openstack so we have several openstack processes running:
$ ps -ef | grep apache2
root 27505 33129 0 14:52 pts/1 00:00:00 sudo vi /var/log/apache2/error.log
root 27506 27505 0 14:52 pts/1 00:00:00 vi /var/log/apache2/error.log
root 31470 1 0 15:07 ? 00:00:00 /usr/sbin/apache2 -k start
ceilome+ 31474 31470 0 15:07 ? 00:00:39 /usr/sbin/apache2 -k start
ceilome+ 31475 31470 0 15:07 ? 00:00:40 /usr/sbin/apache2 -k start
ceilome+ 31476 31470 0 15:07 ? 00:00:40 /usr/sbin/apache2 -k start
ceilome+ 31477 31470 0 15:07 ? 00:00:39 /usr/sbin/apache2 -k start
horizon 31478 31470 1 15:07 ? 00:03:27 /usr/sbin/apache2 -k start
horizon 31479 31470 1 15:07 ? 00:03:28 /usr/sbin/apache2 -k start
horizon 31480 31470 1 15:07 ? 00:03:24 /usr/sbin/apache2 -k start
keystone 31481 31470 0 15:07 ? 00:01:53 /usr/sbin/apache2 -k start
keystone 31482 31470 0 15:07 ? 00:01:50 /usr/sbin/apache2 -k start
keystone 31483 31470 1 15:07 ? 00:02:16 /usr/sbin/apache2 -k start
keystone 31484 31470 1 15:07 ? 00:02:15 /usr/sbin/apache2 -k start
www-data 31485 31470 0 15:07 ? 00:00:37 /usr/sbin/apache2 -k start
www-data 31486 31470 0 15:07 ? 00:01:06 /usr/sbin/apache2 -k start
stack 32639 25024 0 18:38 pts/3 00:00:00 grep apache2

If i do a sudo service apache2 reload anywhere from 3-10 times, at some point it will fail. Note that this happens if there is less than 1 second in between reloads:
$ sudo service apache2 reload
Job for apache2.service failed. See 'systemctl status apache2.service' and 'journalctl -xn' for details.
$ systemctl status apache2.service
● apache2.service - LSB: Apache2 web server
Loaded: loaded (/etc/init.d/apache2)
Active: active (running) (Result: exit-code) since Fri 2015-07-10 15:07:19 EDT; 5h 35min ago
Process: 31436 ExecStop=/etc/init.d/apache2 stop (code=exited, status=1/FAILURE)
Process: 13873 ExecReload=/etc/init.d/apache2 reload (code=exited, status=1/FAILURE)
Process: 31449 ExecStart=/etc/init.d/apache2 start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/apache2.service
└─13413 /usr/sbin/apache2 -k start
$ journalctl -xn
No journal files were found.

Analyzing the core dump file with gdb yields:
$ sudo gdb /usr/sbin/apache2 /tmp/mycoredump/core
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/apache2...done.
[New LWP 31470]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/apache2 -k start'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f1526f92ad6 in ?? ()

In the /var/log/apache2/error.log, each successful reload causes this sequence:

  • SIGUSR1 received
  • services are deregistered
  • destroy/cleanup interpreters/terminate python for each pid
  • attach interpreter

For the situation where we use a sleep 2 (or less) in between reloads, we see in the error log:

  • SIGUSR1 received
  • services are deregistered
  • destroy/cleanup interpreters/terminate python for each pid
    *[core:notice] [pid 48512] AH00060: seg fault or similar nasty error detected in the parent process
    And the attach interpreter never happens.

Then apache2 has to just be restarted (sudo service apache2 restart) before we can do any further reloads. Also, searching around online, I found https://bugs.launchpad.net/ubuntu/+source/redland-bindings/+bug/1416875 which may be related.

Looking at one of the virtual hosts conf file, it's running mod_wsgi in daemon mode and everything looks fine to me:

Listen 127.0.0.1:8777
WSGIPythonHome /opt/stack/service/ceilometer-common/venv/bin/../
<VirtualHost *:8777>
WSGIScriptAlias / /opt/stack/service/ceilometer-common/venv/bin/../lib/python2.7/site-packages/ceilometer/api/app.wsgi
WSGIDaemonProcess ceilometer user=ceilometer group=ceilometer processes=4 threads=5 python-path=/opt/stack/service/ceilometer-common/venv/bin/../lib/python2.7/site-packages
WSGIApplicationGroup ceilometer
WSGIProcessGroup ceilometer

ErrorLog /var/log/apache2/ceilometer_modwsgi.log
LogLevel info
CustomLog /var/log/apache2/ceilometer_access.log combined

<Directory /opt/stack/service/ceilometer-common/venv/bin/../lib/python2.7/site-packages/ceilometer>
Options Indexes FollowSymLinks MultiViews
Require all granted
AllowOverride None
Order allow,deny
allow from all
LimitRequestBody 102400

Do you have any ideas of what could be causing the reloads to fail? Any advice or thoughts would be very useful.

Thank you for looking at this issue,
Heather Brown
hbrown@hp.com

@GrahamDumpleton
Copy link
Owner

If ceilometer is the only WSGI application running in that specific mod_wsgi daemon process group within that VirtualHost, instead of using:

WSGIApplicationGroup ceilometer
WSGIProcessGroup ceilometer

try:

WSGIApplicationGroup %{GLOBAL}
WSGIProcessGroup ceilometer

Some third party Python packages don’t always work properly in Python sub interpreters and instead need to run in the main (first) Python interpreter. If this isn’t done, one of the symptoms can be process crashes. Using ‘%{GLOBAL}’ will force the use of the main interpreter context within the process.

So as first step, the would be best thing to try and see if that helps. If not then we can look in more detail at the other information available you have collected.

You can find a bit more information on this issue with sub interpreters mentioned at:

https://code.google.com/p/modwsgi/wiki/ApplicationIssues#Python_Simplified_GIL_State_API <https://code.google.com/p/modwsgi/wiki/ApplicationIssues#Python_Simplified_GIL_State_API>

Graham

On 11 Jul 2015, at 8:08 am, linuxflower notifications@github.com wrote:

We are running on our own Debian Jessie derivative with apache 2.4.10 and mod-wsgi (4.4.13) and experience a segfault upon multiple reloads of apache. The segfault also seems to be dependent on amount of time between reloads. For example, if we do for i inseq 1 10; do echo $i; sudo service apache2 reload; sleep 2; done; it will fail on the 7th reload every time. If we increase the sleep in between reloads to sleep 3, all 10 reloads are issued successfully every time.

Our Jessie derivative is the base OS for openstack so we have several openstack processes running:
$ ps -ef | grep apache2
root 27505 33129 0 14:52 pts/1 00:00:00 sudo vi /var/log/apache2/error.log
root 27506 27505 0 14:52 pts/1 00:00:00 vi /var/log/apache2/error.log
root 31470 1 0 15:07 ? 00:00:00 /usr/sbin/apache2 -k start
ceilome+ 31474 31470 0 15:07 ? 00:00:39 /usr/sbin/apache2 -k start
ceilome+ 31475 31470 0 15:07 ? 00:00:40 /usr/sbin/apache2 -k start
ceilome+ 31476 31470 0 15:07 ? 00:00:40 /usr/sbin/apache2 -k start
ceilome+ 31477 31470 0 15:07 ? 00:00:39 /usr/sbin/apache2 -k start
horizon 31478 31470 1 15:07 ? 00:03:27 /usr/sbin/apache2 -k start
horizon 31479 31470 1 15:07 ? 00:03:28 /usr/sbin/apache2 -k start
horizon 31480 31470 1 15:07 ? 00:03:24 /usr/sbin/apache2 -k start
keystone 31481 31470 0 15:07 ? 00:01:53 /usr/sbin/apache2 -k start
keystone 31482 31470 0 15:07 ? 00:01:50 /usr/sbin/apache2 -k start
keystone 31483 31470 1 15:07 ? 00:02:16 /usr/sbin/apache2 -k start
keystone 31484 31470 1 15:07 ? 00:02:15 /usr/sbin/apache2 -k start
www-data 31485 31470 0 15:07 ? 00:00:37 /usr/sbin/apache2 -k start
www-data 31486 31470 0 15:07 ? 00:01:06 /usr/sbin/apache2 -k start
stack 32639 25024 0 18:38 pts/3 00:00:00 grep apache2

If i do a sudo service apache2 reload anywhere from 3-10 times, at some point it will fail. Note that this happens if there is less than 1 second in between reloads:
$ sudo service apache2 reload
Job for apache2.service failed. See 'systemctl status apache2.service' and 'journalctl -xn' for details.
$ systemctl status apache2.service
● apache2.service - LSB: Apache2 web server
Loaded: loaded (/etc/init.d/apache2)
Active: active (running) (Result: exit-code) since Fri 2015-07-10 15:07:19 EDT; 5h 35min ago
Process: 31436 ExecStop=/etc/init.d/apache2 stop (code=exited, status=1/FAILURE)
Process: 13873 ExecReload=/etc/init.d/apache2 reload (code=exited, status=1/FAILURE)
Process: 31449 ExecStart=/etc/init.d/apache2 start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/apache2.service
└─13413 /usr/sbin/apache2 -k start
$ journalctl -xn
No journal files were found.

Analyzing the core dump file with gdb yields:
$ sudo gdb /usr/sbin/apache2 /tmp/mycoredump/core
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/ http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/ http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/apache2...done.
[New LWP 31470]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/apache2 -k start'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f1526f92ad6 in ?? ()

In the /var/log/apache2/error.log, each successful reload causes this sequence:

SIGUSR1 received
services are deregistered
destroy/cleanup interpreters/terminate python for each pid
attach interpreter
For the situation where we use a sleep 2 (or less) in between reloads, we see in the error log:

SIGUSR1 received
services are deregistered
destroy/cleanup interpreters/terminate python for each pid *[core:notice] [pid 48512] AH00060: seg fault or similar nasty error detected in the parent process And the attach interpreter never happens.
Then apache2 has to just be restarted (sudo service apache2 restart) before we can do any further reloads. Also, searching around online, I found https://bugs.launchpad.net/ubuntu/+source/redland-bindings/+bug/1416875 https://bugs.launchpad.net/ubuntu/+source/redland-bindings/+bug/1416875 which may be related.

Looking at one of the virtual hosts conf file, it's running mod_wsgi in daemon mode and everything looks fine to me:

Listen 127.0.0.1:8777
WSGIPythonHome /opt/stack/service/ceilometer-common/venv/bin/../

WSGIScriptAlias / /opt/stack/service/ceilometer-common/venv/bin/../lib/python2.7/site-packages/ceilometer/api/app.wsgi
WSGIDaemonProcess ceilometer user=ceilometer group=ceilometer processes=4 threads=5 python-path=/opt/stack/service/ceilometer-common/venv/bin/../lib/python2.7/site-packages
WSGIApplicationGroup ceilometer
WSGIProcessGroup ceilometer

ErrorLog /var/log/apache2/ceilometer_modwsgi.log
LogLevel info
CustomLog /var/log/apache2/ceilometer_access.log combined

Options Indexes FollowSymLinks MultiViews
Require all granted
AllowOverride None
Order allow,deny
allow from all
LimitRequestBody 102400

Do you have any ideas of what could be causing the reloads to fail? Any advice or thoughts would be very useful.

Thank you for looking at this issue,
Heather Brown
hbrown@hp.com mailto:hbrown@hp.com

Reply to this email directly or view it on GitHub #81.

@hellsworth
Copy link
Author

Ok I changed WSGIApplicationGroup from ceilometer to %{GLOBAL} in the ceilometer virtual host. (For the record, all of the other virtual host files already had this %{GLOBAL} setting.) After changing the virtual host file, I restarted apache and tried my "for i in seq 1 10; do echo $i; sudo service apache2 reload; sleep 2; done;" command again. It failed on the 7th issuance of the reload command again.

@hellsworth
Copy link
Author

Well I'm at a loss here. I can't even tell if this issue is a problem with mod-wsgi or apache2. I found the apache code that generates the error "seg fault or similar nasty error detected in the parent process".

In the apache (v2.4.12-2) source, httpd-2.4.12/server/mpm_unix.c:
/* handle all varieties of core dumping signals /
static void sig_coredump(int sig)
{
apr_filepath_set(ap_coredump_dir, pconf);
apr_signal(sig, SIG_DFL);
#if AP_ENABLE_EXCEPTION_HOOK
run_fatal_exception_hook(sig);
#endif
/
linuxthreads issue calling getpid() here:
* This comparison won't match if the crashing thread is
* some module's thread that runs in the parent process.
* The fallout, which is limited to linuxthreads:
* The special log message won't be written when such a
* thread in the parent causes the parent to crash.
/
if (getpid() == parent_pid) {
ap_log_error(APLOG_MARK, APLOG_NOTICE,
0, ap_server_conf, APLOGNO(00060)
"seg fault or similar nasty error detected "
"in the parent process");
/
XXX we can probably add some rudimentary cleanup code here,
* like getting rid of the pid file. If any additional bad stuff
* happens, we are protected from recursive errors taking down the
* system since this function is no longer the signal handler GLA
/
}
kill(getpid(), sig);
/
At this point we've got sig blocked, because we're still inside
* the signal handler. When we leave the signal handler it will
* be unblocked, and we'll take the signal... and coredump or whatever
* is appropriate for this particular Unix. In addition the parent
* will see the real signal we received -- whereas if we called
* abort() here, the parent would only see SIGABRT.
*/
}

My attention is drawn to: "This comparison won't match if the crashing thread is some module's thread that runs in the parent process."

$ ps -p 11120 -lfT
F S UID PID SPID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
5 S ceilome+ 11120 11120 11117 0 80 0 - 152495 poll_s 19:45 ? 00:00:00 /usr/sbin/apache2 -k start
1 S ceilome+ 11120 11139 11117 0 80 0 - 152495 poll_s 19:45 ? 00:00:00 /usr/sbin/apache2 -k start
1 S ceilome+ 11120 11140 11117 0 80 0 - 152495 poll_s 19:45 ? 00:00:00 /usr/sbin/apache2 -k start
1 S ceilome+ 11120 11141 11117 0 80 0 - 152495 futex_ 19:45 ? 00:00:02 /usr/sbin/apache2 -k start
1 S ceilome+ 11120 11142 11117 0 80 0 - 152495 futex_ 19:45 ? 00:00:00 /usr/sbin/apache2 -k start
1 S ceilome+ 11120 11143 11117 0 80 0 - 152495 futex_ 19:45 ? 00:00:00 /usr/sbin/apache2 -k start
1 S ceilome+ 11120 11144 11117 0 80 0 - 152495 futex_ 19:45 ? 00:00:00 /usr/sbin/apache2 -k start
1 S ceilome+ 11120 11145 11117 0 80 0 - 152495 fcntl_ 19:45 ? 00:00:00 /usr/sbin/apache2 -k start

So we have 8 threads of ceilometer running in process 11120, and all under the parent process 11117. For the record, the process 11117 is the apache process itself:
$ ps -ef | grep apache2
root 3691 1 0 Jul16 ? 00:00:00 sudo tail -f /var/log/apache2/error.log
root 3692 3691 0 Jul16 ? 00:00:00 tail -f /var/log/apache2/error.log
root 8126 1 0 Jul14 ? 00:00:00 sudo tail -f /var/log/apache2/error.log
root 8127 8126 0 Jul14 ? 00:00:00 tail -f /var/log/apache2/error.log
root 8519 1 0 Jul16 ? 00:00:00 sudo tail -f /var/log/apache2/error.log
root 8520 8519 0 Jul16 ? 00:00:00 tail -f /var/log/apache2/error.log
root 9305 1 0 Jul17 ? 00:00:00 sudo tail -f /var/log/apache2/error.log
root 9306 9305 0 Jul17 ? 00:00:00 tail -f /var/log/apache2/error.log
root 9486 1 0 Jul15 ? 00:00:00 sudo tail -f /var/log/apache2/error.log
root 9487 9486 0 Jul15 ? 00:00:00 tail -f /var/log/apache2/error.log
root 11117 1 0 19:45 ? 00:00:00 /usr/sbin/apache2 -k start
ceilome+ 11120 11117 0 19:45 ? 00:00:05 /usr/sbin/apache2 -k start
ceilome+ 11121 11117 0 19:45 ? 00:00:05 /usr/sbin/apache2 -k start
ceilome+ 11122 11117 0 19:45 ? 00:00:05 /usr/sbin/apache2 -k start
ceilome+ 11123 11117 0 19:45 ? 00:00:05 /usr/sbin/apache2 -k start
keystone 11124 11117 0 19:45 ? 00:00:04 /usr/sbin/apache2 -k start
keystone 11125 11117 0 19:45 ? 00:00:04 /usr/sbin/apache2 -k start
keystone 11126 11117 0 19:45 ? 00:00:07 /usr/sbin/apache2 -k start
keystone 11127 11117 0 19:45 ? 00:00:07 /usr/sbin/apache2 -k start
www-data 11128 11117 0 19:45 ? 00:00:04 /usr/sbin/apache2 -k start
www-data 11129 11117 0 19:45 ? 00:00:05 /usr/sbin/apache2 -k start
stack 19565 1026 0 20:16 pts/8 00:00:00 grep apache2
root 21430 1 0 Jul16 ? 00:00:00 sudo tail -f /var/log/apache2/error.log
root 21431 21430 0 Jul16 ? 00:00:00 tail -f /var/log/apache2/error.log
root 21941 1 0 Jul16 ? 00:00:00 sudo tail -f /var/log/apache2/error.log
root 21942 21941 0 Jul16 ? 00:00:00 tail -f /var/log/apache2/error.log

I think that all of the ceilometer and keystone processes are wsgi process (because in each of their virtual hosts file, they have WSGIDaemonProcess, WSGIApplicationGroup, WSGIProcessGroup). Is this correct?

If the processes 11120-11127 are wsgi processes (module processes), then none of these threads has the SPID that matches the parent process of apache2, 11117.

Do you have any suggestions?

@GrahamDumpleton
Copy link
Owner

Sorry for the delay on this one. Was in the process of starting a new job so busy with that.

Do you have full stack traces from a core dump? I can't see that you provided any.

Once in gdb, try running:

thread apply all bt

I think this will give full stack traces for all active threads in the process for the core dump. Usually I use this when attached to live process, so hopefully works for core dump as well.

That the issue related to receiving signals while things were possibly still starting up, I can imagine what issue may be as I have had issues with that in the past, but believed they were fixed. I do still know of some issues related to signal handling when preloading Python code into a process, but I can't see right now how that would be related.

@hellsworth
Copy link
Author

Here is the full stack trace:

$ sudo gdb /usr/sbin/apache2 /tmp/mycoredump/core
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/apache2...done.
[New LWP 33130]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/apache2 -k start'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f5083270ad6 in ?? ()
(gdb) bt full
#0 0x00007f5083270ad6 in ?? ()
No symbol table info available.
#1
No locals.
#2 0x00007f508603c293 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
No locals.
#3 0x00007f508654b085 in apr_sleep () from /usr/lib/x86_64-linux-gnu/libapr-1.so.0
No symbol table info available.
#4 0x00007f508653ed11 in ?? () from /usr/lib/x86_64-linux-gnu/libapr-1.so.0
No symbol table info available.
#5 0x00007f508653fca0 in apr_pool_clear () from /usr/lib/x86_64-linux-gnu/libapr-1.so.0
No symbol table info available.
#6 0x00007f5086e4a1c1 in main (argc=3, argv=0x7ffc6f60fed8) at main.c:707
c = 0 '\000'
showcompile = 0
showdirectives = 0
confname = 0x7f5086ea8f23 "apache2.conf"
def_server_root = 0x7f5086ea8f30 "/etc/apache2"
temp_error_log = 0x0
error = 0x0
process = 0x7f5086e07118
pconf = 0x7f5086e05028
plog = 0x7f5086dd3028
ptemp = 0x7f5086daf028
pcommands = 0x7f5086ddd028
opt = 0x7f5086ddd118
rv = 0
mod = 0x7f50870c7080 <ap_prelinked_modules+64>
opt_arg = 0x7f5086e10060 <_rtld_global> "\250\021\341\206P\177"
signal_server = 0x7f5086e898e5 <ap_signal_server>
(gdb) thread apply all bt

Thread 1 (Thread 0x7f5086dff780 (LWP 33130)):
#0 0x00007f5083270ad6 in ?? ()
#1
#2 0x00007f508603c293 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#3 0x00007f508654b085 in apr_sleep () from /usr/lib/x86_64-linux-gnu/libapr-1.so.0
#4 0x00007f508653ed11 in ?? () from /usr/lib/x86_64-linux-gnu/libapr-1.so.0
#5 0x00007f508653fca0 in apr_pool_clear () from /usr/lib/x86_64-linux-gnu/libapr-1.so.0
#6 0x00007f5086e4a1c1 in main (argc=3, argv=0x7ffc6f60fed8) at main.c:707

@bretonium
Copy link

I ran into the same issue too. Segfaults happen on logrotate.

@aglarendil
Copy link

The same is true for me - ran into this issue several times.

@GrahamDumpleton
Copy link
Owner

Try 4.4.15 and see if issue goes away. That includes a fix for core dumps when doing graceful restarts and mod_wsgi was being loaded the first time. The fix was for a permanent failure and not a transient one though, so could well be different.

@bkupidura
Copy link

The same for me. 4.4.15 didn't fix issue.
Using WSGIApplicationGroup %{GLOBAL} mitigates how often issue occurs.

@GrahamDumpleton
Copy link
Owner

GrahamDumpleton commented May 11, 2016

@bkupidura Are you using embedded mode or daemon mode of mod_wsgi?

If using daemon mode, are you also setting:

WSGIRestrictedEmbedded On

Use of WSGIApplicationGroup with value %{GLOBAL}, although recommended be used anyway if other configuration allows, helps to combat problems with third party C extension modules for Python which aren't written properly to work with Python sub interpreters. If WSGIApplicationGroup is truly helping, then that would point to the issue potentially relating to some Python package you are using in your application.

Also, are you perhaps using mod_pagespeed in the same Apache installation?

@bkupidura
Copy link

Daemon mode.

Configuration:

<VirtualHost *:35357>
ServerName SERVER_NAME

Vhost docroot

DocumentRoot "/usr/lib/cgi-bin/keystone"

Directories, there should at least be a declaration for /usr/lib/cgi-bin/keystone

<Directory "/usr/lib/cgi-bin/keystone">
Options Indexes FollowSymLinks MultiViews
AllowOverride None
Require all granted

Logging

ErrorLog "/var/log/apache2/keystone_wsgi_admin_error.log"
ServerSignature Off
CustomLog "/var/log/apache2/keystone_wsgi_admin_access.log" "%h %l %u %t "%r" %>s %b %D "%{Referer}i" "%{User-Agent}i""
WSGIDaemonProcess keystone_admin display-name=keystone-admin group=keystone processes=2 threads=3 user=keystone
WSGIApplicationGroup %{GLOBAL}
WSGIProcessGroup keystone_admin
WSGIScriptAlias / "/usr/lib/cgi-bin/keystone/admin"

Custom fragment

LimitRequestFieldSize 81900

Additing "WSGIRestrictedEmbedded" to mods-enabled/wsgi.conf didnt change anything. Issue still there.

@GrahamDumpleton
Copy link
Owner

Ahhhh. The KeyStone application from OpenStack. I have had separate reports via Red Hat of Apache process crashes on process shutdown when it specifically is being hosted. I am not sure yet what it is about that specific application which is causing problems. I am not seeing reports like that for other applications. One possible cause for problems being investigated is whether KeyStone is creating background threads. These can be a problem because they will not be getting stopped prior to destroying the Python interpreter and so if the thread gets woken up again during shutdown, it can then access invalid memory. It is not an easy problem to solve completely. Any application creating background threads should really be registering an atexit callback that attempts to shutdown the threads so they aren't running when the interpreter is destroyed.

@GrahamDumpleton
Copy link
Owner

See also stack traces in:

from #132.

@GrahamDumpleton
Copy link
Owner

I am going to close this finally. There have been a few fixes since this was originally reported which address problems in mod_wsgi related to memory usage. Also not seen any further reports related to Keystone on Apache for a long time now, so assuming not using Apache anymore or problems may have resolved themselves. Create a new issue if this is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants