ctrl-C doesnt kill off server in radiusd -X #1604

Open
alanbuxey opened this Issue May 6, 2016 · 37 comments

Projects

None yet

4 participants

@alanbuxey
Member

when starting server in debug mode

radiusd -X

can no longer Ctrl-C in the terminal to kill it - just carries on running.... have to 'kill -9 %PID' in another window

@mcnewton
Member
mcnewton commented May 6, 2016

What version? Fine here on v3.1.x HEAD.

@alanbuxey
Member

3.1.x HEAD - i can CTRL-C when its starting up...so theres some point
when something becomes active that its no longer capable of being killed.
so some feature we use / have activated - it used to work so a recent
change in code or a recent change in our config (we recently moved to the
redundant-load-balance pool so
possibly that?) - recent means since mid-April.

alan

On 6 May 2016 at 10:40, Matthew Newton notifications@github.com wrote:

What version? Fine here on v3.1.x HEAD.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#1604 (comment)

@mcnewton
Member
mcnewton commented May 6, 2016

git bisect? :-)

@arr2036
Member
arr2036 commented May 6, 2016 edited

It will now always gracefully shut down, which means it calls the detach methods of the modules, closes open file descriptors, frees memory etc... This is so triggers work correctly.

If you run it under a debugger and send it a SIGINT you should be able to see where it's hanging. Probably in a module.

@arr2036
Member
arr2036 commented May 6, 2016 edited

Nah, no need to go hunting, I know what the general cause probably is.

-m was removed fairly recently. It's not an issue with removing -m and enabling its functionality by default, just that this has probably exposed a pre-existing issue.

@alanbuxey
Member

...go on... ;-)

On 6 May 2016 at 14:23, Arran Cudbard-Bell notifications@github.com wrote:

Nah, no need to go hunting, I know that the general cause probably is.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#1604 (comment)

@arr2036
Member
arr2036 commented May 7, 2016

That's about as specific as it gets :p If you can see where it's hanging we can probably fix it.

@alanbuxey
Member

detail (/dev/shm/detail): Polling for detail file
detail (/dev/shm/detail): Detail listener state unopened waiting 9.960882 sec

Program received signal SIGTERM, Terminated.
0x00007ffff5cfe9b3 in select () from /lib64/libc.so.6
(gdb)
Continuing.
Ready to process requests
detail (/dev/shm/detail): Polling for detail file
detail (/dev/shm/detail): Detail listener state unopened waiting 9.764499 sec

Program received signal SIGTERM, Terminated.
0x00007ffff5cfe9b3 in select () from /lib64/libc.so.6
(gdb)
Continuing.

Program received signal SIGTERM, Terminated.
0x00007ffff5cfe9b3 in select () from /lib64/libc.so.6
(gdb)
Continuing.
detail (/dev/shm/detail): Polling for detail file
detail (/dev/shm/detail): Detail listener state unopened waiting 10.040581 sec
Ready to process requests

detail (/dev/shm/detail): Polling for detail file
detail (/dev/shm/detail): Detail listener state unopened waiting 9.905166 sec

Program received signal SIGTERM, Terminated.
0x00007ffff5cfe9b3 in select () from /lib64/libc.so.6
(gdb)
Continuing.
Ready to process requests

Program received signal SIGTERM, Terminated.
0x00007ffff5cfe9b3 in select () from /lib64/libc.so.6
(gdb)
Continuing.

Program received signal SIGTERM, Terminated.
0x00007ffff5cfe9b3 in select () from /lib64/libc.so.6
(gdb)
Continuing.
Ready to process requests

detail (/dev/shm/detail): Polling for detail file
detail (/dev/shm/detail): Detail listener state unopened waiting 9.776397 sec

@arr2036
Member
arr2036 commented May 9, 2016

(gdb) signal sigint

@alanbuxey
Member

its late. i'm tired. how do i get to type that in? the gdb window is filled with the radiusd -X output, if I ctrl-C in that window and signal sigint etc I get told "No symbol "sigint" in current context."

@arr2036
Member
arr2036 commented May 9, 2016

ah, maybe it doesn't do the name translation. signal 2 should work when freeradius is running in debugger and debugger is paused.

@alanbuxey
Member

(gdb) signal 2
Continuing with signal SIGINT.
Ready to process requests
detail (/dev/shm/detail): Polling for detail file
detail (/dev/shm/detail): Detail listener state unopened waiting 10.126864 sec

@arr2036
Member
arr2036 commented May 9, 2016

Hmm I guess it's that detail thread not being killed correctly.

@alanbuxey
Member
Bump limit for exfiles to 256

:100644 100644 c312260... 95280f1... M src/modules/rlm_detail/rlm_detail.c
:100644 100644 302205f... 90bba4e... M src/modules/rlm_linelog/rlm_linelog.c
:100644 100644 4f0e115... ef02b63... M src/modules/rlm_sql/rlm_sql.c

46da504

?

@alandekok
Member

The exfiles patch is the one breaking SIGINT? I fail to see how that works...

And you can't really test SIGINT trapping with gdb. That's because gdb will trap the signal, and return you to the gdb prompt

@arr2036
Member
arr2036 commented May 9, 2016

Shouldn't do if you raise the signal from within gdb?

@alanbuxey
Member

aha! getting closer..... i had 5 ldap connections defined

ldap ldap1 {}
ldap ldap2 {}

etc..... and calling each one in some virtual servers..... hence lots of connections...hence needing to use a higher ulimit for radius/root user...

I've now reverted that 'best practice' a little - just talking to one ldap server for now.... so my redundant-load-balance now only calls ldap1

and now ctrl-C works again!

and more interesting....look at the bottom line that I now get...

Ready to process requests
^CReady to process requests
Signalled to terminate
Exiting normally
rlm_krb5 (krb5) - Removing connection pool
rlm_krb5 (krb5) - Closing connection (63)
rlm_krb5 (krb5) - Closing connection (62)
rlm_krb5 (krb5) - Closing connection (61)
rlm_krb5 (krb5) - Closing connection (60)
rlm_krb5 (krb5) - Closing connection (59)
rlm_krb5 (krb5) - Closing connection (58)
rlm_krb5 (krb5) - Closing connection (57)
rlm_krb5 (krb5) - Closing connection (56)
rlm_krb5 (krb5) - Closing connection (55)
rlm_krb5 (krb5) - Closing connection (54)
rlm_krb5 (krb5) - Closing connection (53)
rlm_krb5 (krb5) - Closing connection (52)
rlm_krb5 (krb5) - Closing connection (51)
rlm_krb5 (krb5) - Closing connection (50)
rlm_krb5 (krb5) - Closing connection (49)
rlm_krb5 (krb5) - Closing connection (48)
rlm_krb5 (krb5) - Closing connection (47)
rlm_krb5 (krb5) - Closing connection (46)
rlm_krb5 (krb5) - Closing connection (45)
rlm_krb5 (krb5) - Closing connection (44)
rlm_krb5 (krb5) - Closing connection (43)
rlm_krb5 (krb5) - Closing connection (42)
rlm_krb5 (krb5) - Closing connection (41)
rlm_krb5 (krb5) - Closing connection (40)
rlm_krb5 (krb5) - Closing connection (39)
rlm_krb5 (krb5) - Closing connection (38)
rlm_krb5 (krb5) - Closing connection (37)
rlm_krb5 (krb5) - Closing connection (36)
rlm_krb5 (krb5) - Closing connection (35)
rlm_krb5 (krb5) - Closing connection (34)
rlm_krb5 (krb5) - Closing connection (33)
rlm_krb5 (krb5) - Closing connection (32)
rlm_krb5 (krb5) - Closing connection (31)
rlm_krb5 (krb5) - Closing connection (30)
rlm_krb5 (krb5) - Closing connection (29)
rlm_krb5 (krb5) - Closing connection (28)
rlm_krb5 (krb5) - Closing connection (27)
rlm_krb5 (krb5) - Closing connection (26)
rlm_krb5 (krb5) - Closing connection (25)
rlm_krb5 (krb5) - Closing connection (24)
rlm_krb5 (krb5) - Closing connection (23)
rlm_krb5 (krb5) - Closing connection (22)
rlm_krb5 (krb5) - Closing connection (21)
rlm_krb5 (krb5) - Closing connection (20)
rlm_krb5 (krb5) - Closing connection (19)
rlm_krb5 (krb5) - Closing connection (18)
rlm_krb5 (krb5) - Closing connection (17)
rlm_krb5 (krb5) - Closing connection (16)
rlm_krb5 (krb5) - Closing connection (15)
rlm_krb5 (krb5) - Closing connection (14)
rlm_krb5 (krb5) - Closing connection (13)
rlm_krb5 (krb5) - Closing connection (12)
rlm_krb5 (krb5) - Closing connection (11)
rlm_krb5 (krb5) - Closing connection (10)
rlm_krb5 (krb5) - Closing connection (9)
rlm_krb5 (krb5) - Closing connection (8)
rlm_krb5 (krb5) - Closing connection (7)
rlm_krb5 (krb5) - Closing connection (6)
rlm_krb5 (krb5) - Closing connection (5)
rlm_krb5 (krb5) - Closing connection (4)
rlm_krb5 (krb5) - Closing connection (3)
rlm_krb5 (krb5) - Closing connection (2)
rlm_krb5 (krb5) - Closing connection (1)
rlm_krb5 (krb5) - Closing connection (0)
rlm_sql (sql) - Removing connection pool
rlm_sql (sql) - Closing connection (63)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (62)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (61)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (60)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (59)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (58)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (57)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (56)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (55)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (54)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (53)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (52)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (51)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (50)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (49)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (48)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (47)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (46)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (45)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (44)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (43)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (42)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (41)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (40)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (39)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (38)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (37)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (36)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (35)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (34)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (33)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (32)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (31)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (30)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (29)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (28)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (27)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (26)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (25)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (24)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (23)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (22)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (21)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (20)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (19)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (18)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (17)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (16)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (15)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (14)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (13)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (12)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (11)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (10)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (9)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (8)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (7)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (6)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (5)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (4)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (3)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (2)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (1)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_sql (sql) - Closing connection (0)
rlm_sql_postgresql - Socket destructor called, closing socket
rlm_ldap (ldap1) - Removing connection pool
rlm_ldap (ldap1) - Closing connection (63)
rlm_ldap (ldap1) - Closing connection (62)
rlm_ldap (ldap1) - Closing connection (61)
rlm_ldap (ldap1) - Closing connection (60)
rlm_ldap (ldap1) - Closing connection (59)
rlm_ldap (ldap1) - Closing connection (58)
rlm_ldap (ldap1) - Closing connection (57)
rlm_ldap (ldap1) - Closing connection (56)
rlm_ldap (ldap1) - Closing connection (55)
rlm_ldap (ldap1) - Closing connection (54)
rlm_ldap (ldap1) - Closing connection (53)
rlm_ldap (ldap1) - Closing connection (52)
rlm_ldap (ldap1) - Closing connection (51)
rlm_ldap (ldap1) - Closing connection (50)
rlm_ldap (ldap1) - Closing connection (49)
rlm_ldap (ldap1) - Closing connection (48)
rlm_ldap (ldap1) - Closing connection (47)
rlm_ldap (ldap1) - Closing connection (46)
rlm_ldap (ldap1) - Closing connection (45)
rlm_ldap (ldap1) - Closing connection (44)
rlm_ldap (ldap1) - Closing connection (43)
rlm_ldap (ldap1) - Closing connection (42)
rlm_ldap (ldap1) - Closing connection (41)
rlm_ldap (ldap1) - Closing connection (40)
rlm_ldap (ldap1) - Closing connection (39)
rlm_ldap (ldap1) - Closing connection (38)
rlm_ldap (ldap1) - Closing connection (37)
rlm_ldap (ldap1) - Closing connection (36)
rlm_ldap (ldap1) - Closing connection (35)
rlm_ldap (ldap1) - Closing connection (34)
rlm_ldap (ldap1) - Closing connection (33)
rlm_ldap (ldap1) - Closing connection (32)
rlm_ldap (ldap1) - Closing connection (31)
rlm_ldap (ldap1) - Closing connection (30)
rlm_ldap (ldap1) - Closing connection (29)
rlm_ldap (ldap1) - Closing connection (28)
rlm_ldap (ldap1) - Closing connection (27)
rlm_ldap (ldap1) - Closing connection (26)
rlm_ldap (ldap1) - Closing connection (25)
rlm_ldap (ldap1) - Closing connection (24)
rlm_ldap (ldap1) - Closing connection (23)
rlm_ldap (ldap1) - Closing connection (22)
rlm_ldap (ldap1) - Closing connection (21)
rlm_ldap (ldap1) - Closing connection (20)
rlm_ldap (ldap1) - Closing connection (19)
rlm_ldap (ldap1) - Closing connection (18)
rlm_ldap (ldap1) - Closing connection (17)
rlm_ldap (ldap1) - Closing connection (16)
rlm_ldap (ldap1) - Closing connection (15)
rlm_ldap (ldap1) - Closing connection (14)
rlm_ldap (ldap1) - Closing connection (13)
rlm_ldap (ldap1) - Closing connection (12)
rlm_ldap (ldap1) - Closing connection (11)
rlm_ldap (ldap1) - Closing connection (10)
rlm_ldap (ldap1) - Closing connection (9)
rlm_ldap (ldap1) - Closing connection (8)
rlm_ldap (ldap1) - Closing connection (7)
rlm_ldap (ldap1) - Closing connection (6)
rlm_ldap (ldap1) - Closing connection (5)
rlm_ldap (ldap1) - Closing connection (4)
rlm_ldap (ldap1) - Closing connection (3)
rlm_ldap (ldap1) - Closing connection (2)
rlm_ldap (ldap1) - Closing connection (1)
rlm_ldap (ldap1) - Closing connection (0)
rlm_mschap (mschap) - Removing connection pool
rlm_mschap (mschap) - Closing connection (63)
rlm_mschap (mschap) - Closing connection (62)
rlm_mschap (mschap) - Closing connection (61)
rlm_mschap (mschap) - Closing connection (60)
rlm_mschap (mschap) - Closing connection (59)
rlm_mschap (mschap) - Closing connection (58)
rlm_mschap (mschap) - Closing connection (57)
rlm_mschap (mschap) - Closing connection (56)
rlm_mschap (mschap) - Closing connection (55)
rlm_mschap (mschap) - Closing connection (54)
rlm_mschap (mschap) - Closing connection (53)
rlm_mschap (mschap) - Closing connection (52)
rlm_mschap (mschap) - Closing connection (51)
rlm_mschap (mschap) - Closing connection (50)
rlm_mschap (mschap) - Closing connection (49)
rlm_mschap (mschap) - Closing connection (48)
rlm_mschap (mschap) - Closing connection (47)
rlm_mschap (mschap) - Closing connection (46)
rlm_mschap (mschap) - Closing connection (45)
rlm_mschap (mschap) - Closing connection (44)
rlm_mschap (mschap) - Closing connection (43)
rlm_mschap (mschap) - Closing connection (42)
rlm_mschap (mschap) - Closing connection (41)
rlm_mschap (mschap) - Closing connection (40)
rlm_mschap (mschap) - Closing connection (39)
rlm_mschap (mschap) - Closing connection (38)
rlm_mschap (mschap) - Closing connection (37)
rlm_mschap (mschap) - Closing connection (36)
rlm_mschap (mschap) - Closing connection (35)
rlm_mschap (mschap) - Closing connection (34)
rlm_mschap (mschap) - Closing connection (33)
rlm_mschap (mschap) - Closing connection (32)
rlm_mschap (mschap) - Closing connection (31)
rlm_mschap (mschap) - Closing connection (30)
rlm_mschap (mschap) - Closing connection (29)
rlm_mschap (mschap) - Closing connection (28)
rlm_mschap (mschap) - Closing connection (27)
rlm_mschap (mschap) - Closing connection (26)
rlm_mschap (mschap) - Closing connection (25)
rlm_mschap (mschap) - Closing connection (24)
rlm_mschap (mschap) - Closing connection (23)
rlm_mschap (mschap) - Closing connection (22)
rlm_mschap (mschap) - Closing connection (21)
rlm_mschap (mschap) - Closing connection (20)
rlm_mschap (mschap) - Closing connection (19)
rlm_mschap (mschap) - Closing connection (18)
rlm_mschap (mschap) - Closing connection (17)
rlm_mschap (mschap) - Closing connection (16)
rlm_mschap (mschap) - Closing connection (15)
rlm_mschap (mschap) - Closing connection (14)
rlm_mschap (mschap) - Closing connection (13)
rlm_mschap (mschap) - Closing connection (12)
rlm_mschap (mschap) - Closing connection (11)
rlm_mschap (mschap) - Closing connection (10)
rlm_mschap (mschap) - Closing connection (9)
rlm_mschap (mschap) - Closing connection (8)
rlm_mschap (mschap) - Closing connection (7)
rlm_mschap (mschap) - Closing connection (6)
rlm_mschap (mschap) - Closing connection (5)
rlm_mschap (mschap) - Closing connection (4)
rlm_mschap (mschap) - Closing connection (3)
rlm_mschap (mschap) - Closing connection (2)
rlm_mschap (mschap) - Closing connection (1)
rlm_mschap (mschap) - Closing connection (0)
detail (/dev/shm/detail): Reader thread exited without informing the master: Bad file descriptor

alan

@alandekok
Member

The last line just means that we're missing some synchronization on exit, which is a low priority to fix. It doesn't cause any problems during normal operation because that code path is only used on exit.

I also don't see how opening more file descriptors causes CTRL-C to stop working. I'll put it down to a platform-specific bug TBH.

@alanbuxey
Member

ctrl-C now failing again on 3.1.x (was working until last week)

@alandekok
Member

Works for me on OSX and on Ubuntu 15.10.

Can you try with a default configuration, see if it works, and then try to isolate it from that.

@mcnewton
Member

Nothing much has changed in the 3.1.x code since last week apart from reverting a chunk of code that's better in v4. TBH I'd suspect it's some race type condition triggered by the number of instances of rlm_ldap or similar that you're running, and because -m is now default so FR tries to clean up nicely.

Although the cleanest thing is to try and fix the bug, one option is just to restore previous non-"-m" behaviour for 3.1.x, and try and fix it up so everything exits cleanly in v4. If the latest commit broke it again, then it should be fixed in v4 anyway.

@arr2036
Member
arr2036 commented Jun 27, 2016

Restoring non '-m' means the exit triggers don't work correctly.

@alandekok
Member

I'd prefer to figure out why it works for most systems, but not for Buxey's. The fix should be simple once we know what it is.

@mcnewton
Member

I've taken a standard config + four or so different instantiations of ldap each with 64 pool servers, called in redundant load-balance, which is about as similar as I can work out from the info provided so far. And can't reproduce diddly squat. @alanbuxey as Alan D said can you chop your config down to the minimal amount that triggers this? Agree that's probably going to be the only way to find it.

@mcnewton
Member

Forget that... reproduced it :-)

@mcnewton
Member

If there is an ldap search underway and you hit Ctrl-C, you'll have to hit it repeatedly thread pool.max_servers + 1 number of times before you exit, or if the ldap search responds.

So if your ldap search gets stuck and you have a crazy number of servers in the pool, then you'll have to hit Ctrl-C a crazy number of times before the server stops. i.e. each time you hit Ctrl-C the current ldap search stops and it moves on to the next server in the pool.

...
(1)      files (noop)
(1)      ldap - Reserved connection (0)
(1)      ldap - EXPAND (uid=%{%{Stripped-User-Name}:-%{User-Name}})
(1)      ldap - --> (uid=bob)
(1)      ldap - Performing search in "dc=example,dc=com" with filter "(uid=bob)", scope "sub"
(1)      ldap - Waiting for search result...
^C(1)      ldap - Deleting inviable connection (0)
(1)      ldap - Need 6 more connections to reach 10 spares
(1)      ldap - Opening additional connection (5), 1 of 28 pending slots used
rlm_ldap (ldap) - Connecting to ldaps://server.example.com:636
rlm_ldap (ldap) - Waiting for bind result...
rlm_ldap (ldap) - Bind successful
(1)      ldap - Reserved connection (1)
(1)      ldap - WARNING: Search failed: Can't contact LDAP server. Got new socket, retrying...
(1)      ldap - Waiting for search result...
^C(1)      ldap - Deleting inviable connection (1)
(1)      ldap - Need 6 more connections to reach 10 spares
(1)      ldap - Opening additional connection (6), 1 of 28 pending slots used
rlm_ldap (ldap) - Connecting to ldaps://server.example.com:636
rlm_ldap (ldap) - Waiting for bind result...
rlm_ldap (ldap) - Bind successful
(1)      ldap - Reserved connection (2)
(1)      ldap - WARNING: Search failed: Can't contact LDAP server. Got new socket, retrying...
(1)      ldap - Waiting for search result...
^C(1)      ldap - Deleting inviable connection (2)
(1)      ldap - Reserved connection (3)
(1)      ldap - WARNING: Search failed: Can't contact LDAP server. Got new socket, retrying...
(1)      ldap - Waiting for search result...
^C(1)      ldap - Deleting inviable connection (3)
(1)      ldap - Reserved connection (4)
(1)      ldap - WARNING: Search failed: Can't contact LDAP server. Got new socket, retrying...
(1)      ldap - Waiting for search result...
^C(1)      ldap - Deleting inviable connection (4)
(1)      ldap - Reserved connection (5)
(1)      ldap - WARNING: Search failed: Can't contact LDAP server. Got new socket, retrying...
(1)      ldap - Waiting for search result...
^C(1)      ldap - Deleting inviable connection (5)
(1)      ldap - Need 2 more connections to reach 10 spares
(1)      ldap - Opening additional connection (7), 1 of 31 pending slots used
rlm_ldap (ldap) - Connecting to ldaps://server.example.com:636
rlm_ldap (ldap) - Waiting for bind result...
rlm_ldap (ldap) - Bind successful
(1)      ldap - Reserved connection (6)
(1)      ldap - WARNING: Search failed: Can't contact LDAP server. Got new socket, retrying...
(1)      ldap - ERROR: Hit reconnection limit
(1)      ldap - Released connection (6)
(1)      ldap (fail)
(1)    } # authorize (fail)
(1)  Using Post-Auth-Type Reject
(1)  Running Post-Auth-Type Reject from file /opt/fr31/etc/raddb/sites-enabled/default
(1)    Post-Auth-Type REJECT {
(1)      attr_filter.access_reject - EXPAND %{User-Name}
(1)      attr_filter.access_reject - --> bob
(1)      attr_filter.access_reject - Matched entry DEFAULT at line 11
(1)      attr_filter.access_reject (updated)
(1)      eap (noop)
(1)      remove_reply_message_if_eap {
(1)        if (&reply:EAP-Message && &reply:Reply-Message) {
(1)          ...
(1)        }
(1)        else {
(1)          noop (noop)
(1)        } # else (noop)
(1)      } # remove_reply_message_if_eap (noop)
(1)    } # Post-Auth-Type REJECT (updated)
(1)  Delaying response for 1.000000 seconds
Signalled to terminate
Exiting normally
rlm_ldap (ldap) - Removing connection pool
rlm_ldap (ldap) - Closing connection (7)
rlm_ldap (ldap) - Closing connection (6)

All default config except mods-enabled/ldap symlink exists, only config changed from default:

ldap {  
    server = 'ldaps://server.example.com'
    identity = 'user@example.com'
    password = 'password'
    base_dn = 'dc=example,dc=com'
...
    tls {
        ...
        require_cert = 'never'
    }
...

What happens here is the ctrl-c interrupts ldap_search_ext in ldap.c:1050, which returns LDAP_PROC_RETRY - ldap.c:1071 then calls fr_connection_reconnect, which finds a new connection from the pool and returns it.

Essentially the main event loop has signalled itself to quit (radius_signal_self) but the thread stuck doing the ldap lookup doesn't see that - adding debug lines shows handle_signal_self doesn't get called until after the ldap module returns from checking all its available connections from its pool.

Should fr_connection_reconnect check to see if we've received SIGTERM, and return NULL in that case?

This is of course assuming I've hit the same issue...

@alandekok
Member

On Jun 27, 2016, at 5:30 PM, Matthew Newton notifications@github.com wrote:

If there is an ldap search underway and you hit Ctrl-C, you'll have to hit it repeatedly thread pool.max_servers + 1 number of times before you exit, or if the ldap search responds.

Hmm... that's not good.

So if your ldap search gets stuck and you have a crazy number of servers in the pool, then you'll have to hit Ctrl-C a crazy number of times before the server stops. i.e. each time you hit Ctrl-C the current ldap search stops and it moves on to the next server in the pool.

That's just rude. Is the OpenLDAP client library catching CTRL-C? If so, why?

What happens here is the ctrl-c interrupts ldap_search_ext in ldap.c:1050, which returns LDAP_PROC_RETRY - ldap.c:1071 then calls fr_connection_reconnect, which finds a new connection from the pool and returns it.

It should also close the old connection.

Or even better, just return something like LDAP_PROC_INTR, which would mean that the caller keeps the same connection, but just retries.

Essentially the main event loop has signalled itself to quit (radius_signal_self) but the thread stuck doing the ldap lookup doesn't see that - adding debug lines shows handle_signal_self doesn't get called until after the ldap module returns from checking all its available connections from its pool.

That's the problem with threads... any of that signalling is hard to get right.

Should fr_connection_reconnect check to see if we've received SIGTERM, and return NULL in that case?

The thread pool code could do pthread_kill(..., SIGTERM) for all child threads when it's told to exit...

And yes, the connection pool should stop doing work when the server is exiting.

This is of course assuming I've hit the same issue...

Sounds right.

Alan DeKok.

@mcnewton
Member

Actually, libldap shows:

^Cldap_int_select returned -1: errno 4
ldap_err2string
Mon Jun 27 22:58:22 2016 : Info  : (1)      ldap - Deleting inviable connection (4)
Mon Jun 27 22:58:22 2016 : Debug : rlm_ldap (ldap) - Closing libldap handle 0x26b9810
ldap_free_request (origid 2, msgid 2)
ldap_free_connection 1 1
ldap_send_unbind
ldap_free_connection: actually freed

so hitting Ctrl-C is breaking the select(). Which I guess is fair enough.

So I think it's down to the selection pool realising that the server is on its way out, and not returning a new connection to use.

In the radiusd -X case, there is only one thread of course, so doing pthread_kill probably won't help here. Though this issue can be seen with both -X and -fxx -l stdout - in the latter case, the unused threads quit immediately, and the working thread reconnects as usual. In the end (which might take a long time, it doesn't seem to follow the "servers + 1" 'rule') the whole lot eventually bombs out with

ASSERT FAILED src/main/threads.c[786]: thread->status == THREAD_ACTIVE
CAUGHT SIGNAL: Aborted

But ultimately it's the same: if fr_connection_reconnect didn't give a new connection back to the module then it'd all finish a lot sooner.

Matthew

@alanbuxey
Member

looks like it - as you say, many many many ctrl-Cs (we have 5 ldap servers
with 64 connections) and it does close..... - obviously this affects
systemd stop/start and server restarts/kills :/

On 27 June 2016 at 23:13, Matthew Newton notifications@github.com wrote:

Actually, libldap shows:

^Cldap_int_select returned -1: errno 4
ldap_err2string
Mon Jun 27 22:58:22 2016 : Info : (1) ldap - Deleting inviable connection (4)
Mon Jun 27 22:58:22 2016 : Debug : rlm_ldap (ldap) - Closing libldap handle 0x26b9810
ldap_free_request (origid 2, msgid 2)
ldap_free_connection 1 1
ldap_send_unbind
ldap_free_connection: actually freed

so hitting Ctrl-C is breaking the select(). Which I guess is fair enough.

So I think it's down to the selection pool realising that the server is on
its way out, and not returning a new connection to use.

In the radiusd -X case, there is only one thread of course, so doing
pthread_kill probably won't help here. Though this issue can be seen with
both -X and -fxx -l stdout - in the latter case, the unused threads quit
immediately, and the working thread reconnects as usual. In the end (which
might take a long time, it doesn't seem to follow the "servers + 1" 'rule')
the whole lot eventually bombs out with

ASSERT FAILED src/main/threads.c[786]: thread->status == THREAD_ACTIVE
CAUGHT SIGNAL: Aborted

But ultimately it's the same: if fr_connection_reconnect didn't give a
new connection back to the module then it'd all finish a lot sooner.

Matthew


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1604 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ACE-Vb-PC_GPN7PdEEWtl3KFyep0_Eqzks5qQEsDgaJpZM4IYqkI
.

@arr2036
Member
arr2036 commented Aug 21, 2016

This should be fixed in v4.0.x by server re-architecture. We're planing to have one connection per thread.

@mcnewton
Member

That would be good. I hacked around with some ideas to fix this, but ran out of time. But in the case of -X single threaded ISTR the signal handler still needs to do more than just record the fact that the server needs to quit - otherwise the connection pool just gets interrupted and then immediately reconnects with the next connection.

@alandekok
Member

The long-term solution is to differentiate EINTR due to CTRL-C from EINTR due to other sources.

We'll need to update every piece of code which checks for EINTR to also check for "server is supposed to shut down", and then do the right thing.

For the connection pool, it's relatively simple. Just refuse to continue after EINTR, and refuse to return new connections after SIGQUIT (via checking global variable, or some such thing).

For 4.0, if every thread is responsible for it's own connections, which can make it harder. But... each thread also has it's own event loop. Which means that it can check for a global variable there, and then refuse to further service the event loop after SIGQUIT.

@alandekok alandekok added a commit that referenced this issue Sep 25, 2016
@alandekok alandekok Don't open new connections when exiting. Addresses #1604.
When we a get a SIGTERM or SIGQUIT, mark "exiting", and stop
returning new connections.  Also, don't allow reconnection of
existing connections.  This should help with CTRL-C.
82f556b
@alandekok
Member

Please try git v3.0.x head. I've pushed a patch which should fix it. Test it soon, so we can get release 3.0.12 out.

@alanbuxey
Member

is this patch in 3.1.x too? of not, trivial for me to cherry-pick that into
3.1.x ?

alan

On 25 September 2016 at 16:08, Alan DeKok notifications@github.com wrote:

Please try git v3.0.x head. I've pushed a patch which should fix it. Test
it soon, so we can get release 3.0.12 out.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1604 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACE-VfNWPjc2vn_4N2CTIlX5g78-808Tks5qto5vgaJpZM4IYqkI
.

@alandekok
Member

You'll have to see if it goes to 3.1.x. I just did it for 3.0

Sent from my iPhone

On Sep 25, 2016, at 5:15 PM, Alan Buxey notifications@github.com wrote:

is this patch in 3.1.x too? of not, trivial for me to cherry-pick that into
3.1.x ?

alan

On 25 September 2016 at 16:08, Alan DeKok notifications@github.com wrote:

Please try git v3.0.x head. I've pushed a patch which should fix it. Test
it soon, so we can get release 3.0.12 out.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1604 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACE-VfNWPjc2vn_4N2CTIlX5g78-808Tks5qto5vgaJpZM4IYqkI
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@mcnewton
Member

@alanbuxey FYI just had a look at this; doesn't work cleanly on 3.1: it's fine for the main server, but included tools don't define main_config, so it bombs out during linking. Presumably the same in v4.

@alanbuxey
Member
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment