Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak rlm_python #2923

Closed
funzoneq opened this issue Aug 26, 2019 · 14 comments

Comments

@funzoneq
Copy link

commented Aug 26, 2019

Issue type

  • Questions about the server or its usage MUST be posted to the users mailing list. If you post those issues here, they will be closed and locked.
  • Remote security exploits MUST be sent to security@freeradius.org.

REMOVE THOSE WHICH DO NOT APPLY

  • Defect - Crash or memory corruption.
  • Defect - Unexpected behavior (obvious or verified by project member).

See here for debugging instructions and how to obtain backtraces.

Defect/Feature description

How to reproduce issue

CentOS 7 with FreeRADIUS 3.0.13 with rlm_python + eap configured.

Output of [radiusd|freeradius] -X showing issue occurring

(you may need to run [radiusd|freeradius] -fxx -l stdout if using eg RADIUS with TLS)

Tue Aug 20 17:37:31 2019 : Info: Ready to process requests
Tue Aug 20 17:38:40 2019 : Info: *** radiusauth.py: authorize starting ***
Tue Aug 20 17:38:40 2019 : Auth: (0) Login OK: [edgegw01-lab-scl.ps2:2000-255] (from client edgegws port 3997951)
Tue Aug 20 17:41:17 2019 : Info: *** radiusauth.py: authorize starting ***
Tue Aug 20 17:41:17 2019 : Auth: (2) Login OK: [edgegw01-lab-scl.ps2:2000-255] (from client edgegws port 3997951)
Tue Aug 20 17:42:36 2019 : Info: *** radiusauth.py: authorize starting ***
Tue Aug 20 17:42:36 2019 : Auth: (3) Login OK: [edgegw01-lab-scl.ps2:2000-255] (from client edgegws port 0)
Tue Aug 20 17:42:46 2019 : Info: *** radiusauth.py: authorize starting ***
Tue Aug 20 17:42:46 2019 : Auth: (4) Login OK: [edgegw01-lab-scl.ps2:2000-255] (from client edgegws port 0)
Tue Aug 20 17:44:31 2019 : Info: *** radiusauth.py: authorize starting ***
Tue Aug 20 17:44:32 2019 : Info: *** radiusauth.py: Circuit found: 753 ***

After a while 'Invalid user' starts popping up:

Tue Aug 20 17:50:24 2019 : Info: *** radiusauth.py: authorize starting ***
Tue Aug 20 17:50:24 2019 : Error: *** radiusauth.py: Circuit not found: edgegw01-lab-scl.ps2:2000-331 ***
Tue Aug 20 17:50:24 2019 : Auth: (0) Login OK: [edgegw01-lab-scl.ps2:2000-331] (from client edgegws port 0)
Tue Aug 20 17:50:26 2019 : Info: *** radiusauth.py: authorize starting ***
Tue Aug 20 17:50:27 2019 : Error: *** radiusauth.py: Circuit not found: edgegw01-lab-scl.ps2:2000-332 ***
Tue Aug 20 17:50:27 2019 : Auth: (1) Login OK: [edgegw01-lab-scl.ps2:2000-332] (from client edgegws port 0)
Tue Aug 20 17:50:29 2019 : Info: *** radiusauth.py: authorize starting ***
Tue Aug 20 17:50:31 2019 : Auth: (2) Invalid user: [edgegw01-lab-scl.ps2:2000-333] (from client edgegws port 0)
Tue Aug 20 17:50:34 2019 : Info: *** radiusauth.py: authorize starting ***
Tue Aug 20 17:50:35 2019 : Error: *** radiusauth.py: Circuit not found: edgegw01-lab-scl.ps2:2000-334 ***
Tue Aug 20 17:50:35 2019 : Auth: (3) Login OK: [edgegw01-lab-scl.ps2:2000-334] (from client edgegws port 0)
Tue Aug 20 17:50:37 2019 : Info: *** radiusauth.py: authorize starting ***

It only gets worse until the only thing that gets sent back is:

Tue Aug 20 18:44:21 2019 : Info: *** radiusauth.py: authorize starting ***
Tue Aug 20 18:44:21 2019 : Auth: (1094) Login OK: [edgegw01-lab-scl.ps2:2000-255] (from client edgegws port 3997951)
Tue Aug 20 18:44:22 2019 : Auth: (1095) Invalid user: [edgegw01-lab-scl.ps2:2000-704] (from client edgegws port 0)
Tue Aug 20 18:44:25 2019 : Auth: (1096) Invalid user: [edgegw01-lab-scl.ps2:2000-705] (from client edgegws port 0)
Tue Aug 20 18:44:28 2019 : Auth: (1097) Invalid user: [edgegw01-lab-scl.ps2:2000-706] (from client edgegws port 0)
Tue Aug 20 18:44:31 2019 : Auth: (1098) Invalid user: [edgegw01-lab-scl.ps2:2000-707] (from client edgegws port 0)
Tue Aug 20 18:44:34 2019 : Info: *** radiusauth.py: authorize starting ***
Tue Aug 20 18:44:36 2019 : Auth: (1099) Invalid user: [edgegw01-lab-scl.ps2:2000-708] (from client edgegws port 0)
Tue Aug 20 18:44:39 2019 : Auth: (1100) Invalid user: [edgegw01-lab-scl.ps2:2000-709] (from client edgegws port 0)
Tue Aug 20 18:44:42 2019 : Auth: (1101) Invalid user: [edgegw01-lab-scl.ps2:2000-710] (from client edgegws port 0)
Tue Aug 20 18:44:45 2019 : Auth: (1102) Invalid user: [edgegw01-lab-scl.ps2:2000-711] (from client edgegws port 0)
Tue Aug 20 18:44:48 2019 : Auth: (1103) Invalid user: [edgegw01-lab-scl.ps2:2000-712] (from client edgegws port 0)
Tue Aug 20 18:44:51 2019 : Auth: (1104) Invalid user: [edgegw01-lab-scl.ps2:2000-713] (from client edgegws port 0)

Graph of memory usage:

leak

@funzoneq

This comment has been minimized.

Copy link
Author

commented Aug 26, 2019

Waking up in 0.3 seconds.
(732) Received Access-Request Id 239 from 127.0.0.1:34852 to 127.0.0.1:1812 length 85
(732)   User-Name = "edgegw01-lab-scl.ps2:2003-894"
(732)   User-Password = "xxxxxxxxxxx"
(732) # Executing section authorize from file /etc/raddb/sites-enabled/default
(732)   authorize {
(732)     policy filter_username {
(732)       if (&User-Name) {
(732)       if (&User-Name)  -> TRUE
(732)       if (&User-Name)  {
(732)         if (&User-Name =~ / /) {
(732)         if (&User-Name =~ / /)  -> FALSE
(732)         if (&User-Name =~ /@[^@]*@/ ) {
(732)         if (&User-Name =~ /@[^@]*@/ )  -> FALSE
(732)         if (&User-Name =~ /\.\./ ) {
(732)         if (&User-Name =~ /\.\./ )  -> FALSE
(732)         if ((&User-Name =~ /@/) && (&User-Name !~ /@(.+)\.(.+)$/))  {
(732)         if ((&User-Name =~ /@/) && (&User-Name !~ /@(.+)\.(.+)$/))   -> FALSE
(732)         if (&User-Name =~ /\.$/)  {
(732)         if (&User-Name =~ /\.$/)   -> FALSE
(732)         if (&User-Name =~ /@\./)  {
(732)         if (&User-Name =~ /@\./)   -> FALSE
(732)       } # if (&User-Name)  = notfound
(732)     } # policy filter_username = notfound
(732)     [preprocess] = ok
(732)     [python] = fail
(732)   } # authorize = fail
(732) Invalid user: [edgegw01-lab-scl.ps2:2003-894] (from client edgegws port 0)
(732) Using Post-Auth-Type Reject
(732) # Executing group from file /etc/raddb/sites-enabled/default
(732)   Post-Auth-Type REJECT {
(732) attr_filter.access_reject: EXPAND %{User-Name}
(732) attr_filter.access_reject:    --> edgegw01-lab-scl.ps2:2003-894
(732) attr_filter.access_reject: Matched entry DEFAULT at line 11
(732)     [attr_filter.access_reject] = updated
(732)     [eap] = noop
(732)     policy remove_reply_message_if_eap {
(732)       if (&reply:EAP-Message && &reply:Reply-Message) {
(732)       if (&reply:EAP-Message && &reply:Reply-Message)  -> FALSE
(732)       else {
(732)         [noop] = noop
(732)       } # else = noop
(732)     } # policy remove_reply_message_if_eap = noop
(732)   } # Post-Auth-Type REJECT = updated
(732) Delaying response for 1.000000 seconds
Waking up in 0.3 seconds.
Waking up in 0.6 seconds.
(727) Cleaning up request packet ID 164 with timestamp +734
(728) Cleaning up request packet ID 10 with timestamp +734
(731) Sending delayed response
(731) Sent Access-Reject Id 214 from 127.0.0.1:1812 to 127.0.0.1:53665 length 20
(732) Sending delayed response
(732) Sent Access-Reject Id 239 from 127.0.0.1:1812 to 127.0.0.1:34852 length 20
@alandekok

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

3.0.13 is over two years old. We are currently at 3.0.19. Please recheck with the latest release.

There is no point for us to track down and fix bugs which were already tracked down and fixed two years ago.

If it cannot be reproduced in 3.0.19, then please close this bug. If it can be reproduced with 3.0,.19, then please confirm. Otherwise we will close this issue in 30 days, as being only relevant to an old version of the server.

Complete packages for 3.0.19 are available at http://packages.networkradius.com

For whatever reasons, most OS vendors ship releases that are years out of date.

@funzoneq

This comment has been minimized.

Copy link
Author

commented Aug 26, 2019

Hi @alandekok,

I re-checked the same configuration with the latest RPMs. Same result:

(349) Received Access-Request Id 36 from 192.168.200.34:35335 to 192.168.200.34:1812 length 85
(349)   User-Name = "edgegw01-lab-scl.ps2:2015-956"
(349)   User-Password = "xxxxxxxxx"
(349) # Executing section authorize from file /etc/raddb/sites-enabled/default
(349)   authorize {
(349)     policy filter_username {
(349)       if (&User-Name) {
(349)       if (&User-Name)  -> TRUE
(349)       if (&User-Name)  {
(349)         if (&User-Name =~ / /) {
(349)         if (&User-Name =~ / /)  -> FALSE
(349)         if (&User-Name =~ /@[^@]*@/ ) {
(349)         if (&User-Name =~ /@[^@]*@/ )  -> FALSE
(349)         if (&User-Name =~ /\.\./ ) {
(349)         if (&User-Name =~ /\.\./ )  -> FALSE
(349)         if ((&User-Name =~ /@/) && (&User-Name !~ /@(.+)\.(.+)$/))  {
(349)         if ((&User-Name =~ /@/) && (&User-Name !~ /@(.+)\.(.+)$/))   -> FALSE
(349)         if (&User-Name =~ /\.$/)  {
(349)         if (&User-Name =~ /\.$/)   -> FALSE
(349)         if (&User-Name =~ /@\./)  {
(349)         if (&User-Name =~ /@\./)   -> FALSE
(349)       } # if (&User-Name)  = notfound
(349)     } # policy filter_username = notfound
(349)     [preprocess] = ok
(349)     [python] = fail
(349)   } # authorize = fail
(349) Invalid user: [edgegw01-lab-scl.ps2:2015-956] (from client edgegws port 0)
(349) Using Post-Auth-Type Reject
(349) # Executing group from file /etc/raddb/sites-enabled/default
(349)   Post-Auth-Type REJECT {
(349) attr_filter.access_reject: EXPAND %{User-Name}
(349) attr_filter.access_reject:    --> edgegw01-lab-scl.ps2:2015-956
(349) attr_filter.access_reject: Matched entry DEFAULT at line 11
(349)     [attr_filter.access_reject] = updated
(349)     [eap] = noop
(349)     policy remove_reply_message_if_eap {
(349)       if (&reply:EAP-Message && &reply:Reply-Message) {
(349)       if (&reply:EAP-Message && &reply:Reply-Message)  -> FALSE
(349)       else {
(349)         [noop] = noop
(349)       } # else = noop
(349)     } # policy remove_reply_message_if_eap = noop
(349)   } # Post-Auth-Type REJECT = updated
(349) Login incorrect: [edgegw01-lab-scl.ps2:2015-956] (from client edgegws port 0)
(349) Delaying response for 1.000000 seconds
[root@radius01-lab-emy ~]# radiusd -v
radiusd: FreeRADIUS Version 3.0.19, for host x86_64-redhat-linux-gnu, built on Aug  2 2019 at 19:38:39
FreeRADIUS Version 3.0.19
Copyright (C) 1999-2019 The FreeRADIUS server project and contributors
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE
You may redistribute copies of FreeRADIUS under the terms of the
GNU General Public License
For more information about these matters, see the file named COPYRIGHT
@alandekok

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

Thanks. We'll take a look and see if we can find anything.

@arr2036

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

Valgrind or asan output would be helpful

@funzoneq

This comment has been minimized.

Copy link
Author

commented Aug 26, 2019

Is this enough? Let me know which command line options you would like.

valgrind-full.txt
valgrind.txt

@alandekok

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

==23967== 140,712 bytes in 1,378 blocks are possibly lost in loss record 5,549 of 5,566
==23967==    at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==23967==    by 0xEC8EEAC: PyObject_Malloc (in /usr/lib64/libpython2.7.so.1.0)
...
==23967== 454,016 bytes in 301 blocks are possibly lost in loss record 5,564 of 5,566
==23967==    at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==23967==    by 0xEC8EEAC: PyObject_Malloc (in /usr/lib64/libpython2.7.so.1.0)
...
==23967== 786,432 bytes in 1 blocks are possibly lost in loss record 5,565 of 5,566
==23967==    at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==23967==    by 0xEC86888: ??? (in /usr/lib64/libpython2.7.so.1.0)
==23967==    by 0xEC97CD0: PyString_InternInPlace (in /usr/lib64/libpython2.7.so.1.0)
...

The issues are in Python, not in FreeRADIUS. Your script likely needs to force garbage collection periodically:

https://docs.python.org/2/library/gc.html

From the side of FreeRADIUS, we just call the Python API to interpret the script. There may be a way to force it from the rlm_python.c file. But it's easier to do it from the script. And, Python2 is going to be EOL in a few months.

@arr2036

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

That's not necessarily the issue, python's C API is pretty horrific, we could be missing a reference decrement somewhere non-obvious, we'd need to see the complete stack trace to be sure. @funzoneq could you run valgrind again with --num-callers=30 ensuring you have the debug symbols in place for python and freeradius.

@funzoneq

This comment has been minimized.

Copy link
Author

commented Aug 26, 2019

@arr2036 : valgrind.txt

@alandekok that was my first thought too, but I ran my python script with memory analyzer: https://pythonhosted.org/Pympler/muppy.html and it showed no memory leaks.

@arr2036

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

@funzoneq - Unfortunately still not enough of the stack trace to be useful. Can you keep increasing until you hit the thread entry point frame.

@funzoneq

This comment has been minimized.

Copy link
Author

commented Aug 27, 2019

@arr2036 I see main now. valgrind.txt.zip

@alandekok

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

All of the leaks show that execution begins with mod_instantiate() in rlm_python. That function just loads the Python files. It doesn't execute them.

Are you starting up multiple threads in Python from the instantiation function?

What happens when you run a python script that does nothing, and just returns "ok"?

The valgrind stack trace doesn't show any leaks from run-time code in rlm_python. i.e. there is zero reference to mod_authorize in the valgrind output. Which means that the leaks aren't happening during that function call.

@arr2036

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

I agree with @alandekok, looks like for some reason the destructor for the module isn't being called, and that's what's generating all the noise. There's nothing there which shows runtime leaks, so there's nothing that we can really investigate on our side.

The only other thing I would suggest is to use a build from the master branch and see if the issue is still present there. The thread instantiation and destruction is better there, so if nothing else the valgrind noise should be greatly reduced.

@arr2036

This comment has been minimized.

Copy link
Member

commented Sep 18, 2019

Closing this issue until @funzoneq provides test data from master branch.

@arr2036 arr2036 closed this Sep 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.