-
Notifications
You must be signed in to change notification settings - Fork 10
Out of memory: Kill process 3386 (kurento-media-s) score 765 or sacrifice child #313
Comments
i am using aws t2.medium instance ,i analysed further i can see with 14 user Kurento start with 35-40 % CPU utilization but suddenly it goes to 200% for a moment and that's why Kernel is killing kurento process, |
the issue is coming when i am rejoining user with different video streams QCIF SD HD ,i can see kurento process stuck (KMS 6.7.2) [Thread debugging using libthread_db enabled] |
@j1elo This is one more issue i am able to reproduce , i will enable debug symbos/logs and will share with you the exact steps and backtrace. |
Note that we have received reports from other users of KMS with 6.8.1 and DEV, testing 1-to-400 WebRTC broadcasting, and also 30-to-30 video conference, and it all worked fine. We are investigating a possible memory leak. But it shouldn't break so easily with only 12 streams. |
@j1elo actually in my case it cpu utilision going high for a moment and kernel killing it. oom-killer will kill processes with the highest score first, generally targeting processes that are using up the most cpu and memory. |
can you tell me how much was the cpu utilization for 400 Webrtc and 30-30 video conference. |
Look out for the Also keep in mind that KMS has trouble processing 1080p streams. We had an user with a product for 4K videos, they tried to send 4K through KMS but after some experimentation they found out the maximum possible is 720p. And this was only 1 stream... if it's 12 streams I would say that it's normal to see the CPU getting exhausted. Another user tried 1080p RTSP streams but found out something similar, max. 720p and 2MBps (see notes here) |
@j1elo in our case it was 720p, we were selecting HD but browser actually sending 720p stream. I verified it with chrome://webrtc-internals |
Ok, then start with only 1 participant, then monitor CPU and slowly add participants 1 by 1, so you can get how many participants are able to work at the same time. I'll ask the Kurento team bout these limits. |
Hey @puneet89 - we are considering moving forward and implementing the latest version of kms into our production environment to hopefully get away from the segmentation fault issue detailed in #247 . Have you any further insights following your testing? I would also appreciate it if you could detail the exact versions of kms and libnice that you have been using. Thankyou! |
issue #247 is definitely solved and its working fine on latest dev release. @ffxxrr This week i am planning to test Kurento Dev release (latest) for other scenario (not specific to #247),as i am facing new issue like Gstreamer Critical error etc on latest release, I will share detailed analysis of issues on latest release asap as asked by @j1elo #247 (comment) . |
Had a similar issue like this pop up yesterday. We had 8 users in SFU, all sharing video. M5-4XL instance. Kurento started pegging the CPU like I've never seen, normally I see around 30-40% (6% total) core utilization for 8 users all sharing video. About 10 minutes in, started getting above 100% core usage. At 20 minutes in, the core % was dancing between 700% and 1000%. This is on 6.8.1 btw. I'll see if I can repro this today, so far I haven't had any luck. I'll also see if any of our users are using a 1080p camera |
Please remember to share any finding you may have, if it helps to solve some performance issues with Kurento we'll all benefit from it. Key things to check:
Of these, video transcoding is the main user of CPU cycles, because encoding video is a computationally expensive operation. As mentioned earlier, keep an eye on the If you see that TRANSCODING is ACTIVE at some point, you may get a bit more information about why, by adding this: Then looking for these messages:
Which will end up with either of these sets of messages:
The This comment ended up being quite verbose, I think I'll use it to add a new section in the Troubleshooting Issues documentation in case it helps somebody :) |
@j1elo sorry for late reply , actually i have compiled with all Debug symbols on one of servers but not able to reproduce crash or cpu high issue on this server. production server kurento logs without debug symbol enabled. |
@j1elo kurento-media-server Version - 6.9.0.xenial~201812 amd64 Debug level stack trace:( log file also attached) Segmentation fault (thread 139921209538304, pid 26999) 2018-12-07T06:56:49+00:00 -- New execution |
@j1elo Again i am getting following Gstreamer Critical error. Can you please check log file once. (kurento-media-server:9545): GStreamer-CRITICAL **: Element rtpbin460 already has a pad named send_rtp_sink_0, the behaviour of gst_element_get_request_pad() for existing pads is undefined! (kurento-media-server:595): GStreamer-CRITICAL **: Element rtpbin6 already has a pad named send_rtp_sink_0, the behaviour of gst_element_get_request_pad() for existing pads is undefined! 2018-12-11T08:08:33+00:00 -- New execution (kurento-media-server:1031): GStreamer-CRITICAL **: Element rtpbin37 already has a pad named send_rtp_sink_0, the behaviour of gst_element_get_request_pad() for existing pads is undefined! 2018-12-11T08:16:40+00:00 -- New execution (kurento-media-server:2219): GStreamer-CRITICAL **: Element rtpbin914 already has a pad named send_rtp_sink_0, the behaviour of gst_element_get_request_pad() for existing pads is undefined! 2018-12-11T14:32:24+00:00 -- New execution (kurento-media-server:31700): GStreamer-CRITICAL **: Element rtpbin39 already has a pad named send_rtp_sink_0, the behaviour of gst_element_get_request_pad() for existing pads is undefined! 2018-12-11T14:39:31+00:00 -- New execution (kurento-media-server:32590): GStreamer-CRITICAL **: Element rtpbin32 already has a pad named send_rtp_sink_0, the behaviour of gst_element_get_request_pad() for existing pads is undefined! |
@j1elo please let me know if you need more logs as i am getting this crash on latest Kurento Only . kurento-media-server 6.9.0.xenial~2018120517 amd64 Kurento Media Serve |
Yes, stack traces without debug symbols are not very useful; please install all debug symbols before getting a stack trace, instructions here: https://doc-kurento.readthedocs.io/en/latest/user/troubleshooting.html#media-server-crashed |
However in your previous logs the debug symbols were installed, as I can see file names and line numbers, but they don't point to any interesting place in the source code. I think the crash is happening because of memory corruption or other kind of issue that doesn't show up in the stack trace. This probably requires debugging with GDB to get a breakpoint exactly when the crash happens. Please describe the situation when the crash occurs. Is the CPU at 100% load when KMS crashes? I have seen |
@j1elo ./kurento-media-server/server/kurento-media-server --version Crash BackTrace:
Please let me know if you need any further Inputs. |
I can also see following Warnings before crash : 0:04:55.288650955 4912 0x7f64d40734f0 FIXME default gstutils.c:3766:gst_pad_create_stream_id_internal:nicesrc1:src Creating random stream-id, consider implementing a deterministic way of creating a stream-id |
Hi, that looks like an abort message from GLib, the system library, complaining that it wasn't able to allocate some memory. I assume because all memory in the machine was full... Did you see this message in the log?
All other Warnings in your other message are unrelated to this issue. The Did you monitor memory usage during this crash? |
I observed that RAM was enough ( around 800MB free) But CPU was taking Spike randomly i.e. (200%) 2 core even with 5-6 users with H264 Only Mode. (note: VP8 commented in SdpEndpoint file) |
@j1elo i do see error 2019-02-02T16:23:48,204248 11159 0x00007f58ae7fc700 error glib GLib:0 () /build/glib2.0-7ZsPUq/glib2.0-2.48.2/./glib/gmem.c:100: failed to allocate 213942697 bytes I also noticed sudden spike in RAM (70%) consumption by kurento even with 4-5 fake users video , system had 2.3 GB RAM free when kurento was started. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND Please find attached log file also. Please let me know if you need more details. |
That's the most surprising, it tries to allocate 200 MB! I'd need more information about these things:
|
Hi @j1elo 1 .we have created same pipeline as used by java groupcall example. https://doc-kurento.readthedocs.io/en/stable/tutorials/java/tutorial-groupcall.html
Let me know if you need any further info. Thanks & Regards |
HI @j1elo I am trying to run 50 User in Single conference on aws c5.large( 4 GB RAM) instance , but when i reach around 28-34 users CPU goes to 100 % or sometimes RAM utilization also goes full and kurento stuck with 100% cpu ( of all cores). **On same server when i am running 10 User each in 5 Conference or 5 User Each in 10 Conference its working perfectly fine. if 50 user are distributed in different conference no issue is coming ** . I am using latest master branch . I noticed following when cpu get stuck , can following trace can help to know the root cause ? let me know if you need further details root 12594 1 99 10:42 pts/1 00:31:45 kurento-media-server/server/kurento-media-server --modules-path=. --modules-config-path=./config --conf-file=./config/kurento.conf.json --gst-plugin-path=. --logs-path, -d /home/ubuntu/VOICE/ root@ip-20-3-20-98:/home/ubuntu# cat /proc/12594/stack |
Number of Threads 5386 for around 34-40 user in single Conference , each thread taking 1 % cpu it seems every thread got stuck somewhere top - 13:33:09 up 46 min, 2 users, load average: 180.20, 158.27, 102.17 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
Hi , we have implemented horizontal scaling to scale single conference across multiple kurento media server as we were not able to run big conference on single kurento Instance due to cpu spike issue,But even in this case also we got kurento cpu spike as described below . Conference Distributed across 5 c5.large aws VM's . As soon as I started 8th conference Kurento CPU spikes to 100% and gdb dump of all threads as below : AWS Media Server Stats: Media Server Stats per MeetingId gdb thread LOGS attached . Thanks & Regards |
Hi, thanks for taking the time to get the traces, but lines like
won't help much to know where is the problem (apart from knowing that is somewhere in GStreamer). Please see: #313 (comment) |
okay @j1elo , i will try to share more details . |
Hi @j1elo Regards |
@puneet89 We've been lately studying a similar crash in Kurento that happened because of a bug in GStreamer H.264 parser code: #352 This bug had been fixed in a more modern GStreamer version than that of the Kurento fork. @prlanzarin contributed a cherry-pick backport of the relevant commit in Kurento/gst-plugins-good#2, so now we have that bug solved. The symptoms of that bug are very similar to the observations in your bug report, so maybe they are actually caused by the same bug. Please install Kurento nightly or apt-get upgrade if you already have nightly installed, then test your use case and let us know if the issue is gone. |
@j1elo Thanks ! |
This issue is not coming anymore , just sometimes i see following warning (kurento-media-server:26306): GStreamer-CRITICAL **: Element rtpbin33 already has a pad named send_rtp_sink_1, the behaviour of gst_element_get_request_pad() for existing pads is undefined! (kurento-media-server:26306): GStreamer-CRITICAL **: Element rtpbin34 already has a pad named send_rtp_sink_1, the behaviour of gst_element_get_request_pad() for existing pads is undefined! No out of memory ,no crash observed. |
Hi, so just to confirm: This issue does not happen any more to you since you installed Kurento nightly, is that correct? |
I have installed -
KMS 6.8.1 and Dev branch
I am getting following error on Kurento when more than 12 user joining conference and after some time binary killed by kernel.
(kurento-media-server:1251): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
KMS 6.7.2
Joined with 14 user work fine till 1 hour after that KMS binary stopped and following error logs reported.
(kurento-media-server:3386): GStreamer-CRITICAL **: gst_mini_object_unlock: assertion 'state >= SHARE_ONE' failed
DMESG LOGS: It seems some memory leak in KMS as kernel killing kurento process.
**[15874.505398] Out of memory: Kill process 3386 (kurento-media-s) score 765 or sacrifice child
[15874.511950] Killed process 3386 (kurento-media-s) total-vm:10667176kB, anon-rss:3087576kB, file-rss:0kB
[15874.505143] java invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[15874.505146] java cpuset=/ mems_allowed=0
[15874.505151] CPU: 0 PID: 3436 Comm: java Not tainted 4.4.0-1072-aws #82-Ubuntu
[15874.505153] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[15874.505155] 0000000000000286 9cd6e1d582553a57 ffff88003687f9e8 ffffffff81401783
[15874.505158] ffff88003687fba0 ffff8800e9f31b80 ffff88003687fa58 ffffffff8121039a
[15874.505160] ffffffff81cc7209 0000000000000000 ffffffff81e6b2a0 0000000000000206
[15874.505163] Call Trace:
[15874.505170] [] dump_stack+0x63/0x90
[15874.505175] [] dump_header+0x5a/0x1c5
[15874.505178] [] oom_kill_process+0x202/0x3c0
[15874.505181] [] out_of_memory+0x219/0x460
[15874.505185] [] __alloc_pages_slowpath.constprop.88+0x943/0xaf0
[15874.505187] [] __alloc_pages_nodemask+0x288/0x2a0
[15874.505190] [] alloc_pages_current+0x8c/0x110
[15874.505193] [] __page_cache_alloc+0xab/0xc0
[15874.505195] [] filemap_fault+0x160/0x440
[15874.505199] [] ext4_filemap_fault+0x36/0x50
[15874.505202] [] __do_fault+0x77/0x110
[15874.505204] [] handle_mm_fault+0x1252/0x1b70
[15874.505208] [] __do_page_fault+0x1a4/0x410
[15874.505210] [] do_page_fault+0x22/0x30
[15874.505214] [] page_fault+0x28/0x30
[15874.505216] Mem-Info:
[15874.505220] active_anon:933608 inactive_anon:1067 isolated_anon:0
active_file:523 inactive_file:714 isolated_file:0
unevictable:913 dirty:0 writeback:0 unstable:0
slab_reclaimable:4455 slab_unreclaimable:10385
mapped:1448 shmem:1483 pagetables:4107 bounce:0
free:20927 free_pcp:138 free_cma:0
[15874.505224] Node 0 DMA free:15900kB min:264kB low:328kB high:396kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[15874.505230] lowmem_reserve[]: 0 3729 3919 3919 3919
[15874.505233] Node 0 DMA32 free:64792kB min:64064kB low:80080kB high:96096kB active_anon:3567896kB inactive_anon:4260kB active_file:2092kB inactive_file:2856kB unevictable:3080kB isolated(anon):0kB isolated(file):0kB present:3915776kB managed:3835340kB mlocked:3080kB dirty:0kB writeback:0kB mapped:5220kB shmem:5896kB slab_reclaimable:17140kB slab_unreclaimable:38480kB kernel_stack:15312kB pagetables:15108kB unstable:0kB bounce:0kB free_pcp:552kB local_pcp:432kB free_cma:0kB writeback_tmp:0kB pages_scanned:32360 all_unreclaimable? yes
[15874.505239] lowmem_reserve[]: 0 0 189 189 189
[15874.505242] Node 0 Normal free:3016kB min:3248kB low:4060kB high:4872kB active_anon:166536kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:572kB isolated(anon):0kB isolated(file):0kB present:262144kB managed:193804kB mlocked:572kB dirty:0kB writeback:0kB mapped:572kB shmem:36kB slab_reclaimable:680kB slab_unreclaimable:3060kB kernel_stack:1584kB pagetables:1320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[15874.505247] lowmem_reserve[]: 0 0 0 0 0
[15874.505250] Node 0 DMA: 14kB (U) 18kB (U) 116kB (U) 032kB 264kB (U) 1128kB (U) 1256kB (U) 0512kB 11024kB (U) 12048kB (M) 34096kB (M) = 15900kB
[15874.505261] Node 0 DMA32: 6264kB (UE) 2288kB (UME) 26316kB (UME) 54232kB (UE) 34464kB (UME) 96128kB (UME) 18256kB (UME) 0512kB 01024kB 02048kB 04096kB = 64792kB
[15874.505272] Node 0 Normal: 4844kB (UMH) 398kB (M) 4816kB (UM) 032kB 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 0*4096kB = 3016kB
[15874.505281] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[15874.505282] 3382 total pagecache pages
[15874.505284] 0 pages in swap cache
[15874.505285] Swap cache stats: add 0, delete 0, find 0/0
[15874.505286] Free swap = 0kB
[15874.505287] Total swap = 0kB
[15874.505289] 1048477 pages RAM
[15874.505290] 0 pages HighMem/MovableOnly
[15874.505291] 37216 pages reserved
[15874.505291] 0 pages cma reserved
[15874.505292] 0 pages hwpoisoned
[15874.505293] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[15874.505298] [ 372] 0 372 8035 724 20 3 0 0 systemd-journal
[15874.505301] [ 427] 0 427 23693 167 17 3 0 0 lvmetad
[15874.505303] [ 457] 0 457 10669 657 23 3 0 -1000 systemd-udevd
[15874.505305] [ 706] 100 706 25081 431 20 3 0 0 systemd-timesyn
[15874.505307] [ 944] 0 944 4030 533 11 3 0 0 dhclient
[15874.505309] [ 1090] 0 1090 1305 29 9 3 0 0 iscsid
[15874.505312] [ 1091] 0 1091 1430 877 9 3 0 -17 iscsid
[15874.505314] [ 1095] 107 1095 10755 562 26 3 0 -900 dbus-daemon
[15874.505316] [ 1097] 0 1097 7136 497 18 3 0 0 systemd-logind
[15874.505318] [ 1103] 0 1103 68654 794 37 3 0 0 accounts-daemon
[15874.505320] [ 1111] 0 1111 75444 1573 31 6 0 0 amazon-ssm-agen
[15874.505322] [ 1118] 104 1118 65157 511 28 3 0 0 rsyslogd
[15874.505324] [ 1120] 0 1120 6511 416 18 3 0 0 atd
[15874.505326] [ 1125] 0 1125 77160 242 20 3 0 0 lxcfs
[15874.505329] [ 1132] 0 1132 6932 383 18 3 0 0 cron
[15874.505331] [ 1136] 0 1136 78097 3009 39 5 0 -900 snapd
[15874.505333] [ 1145] 0 1145 1099 306 8 3 0 0 acpid
[15874.505335] [ 1164] 0 1164 16377 569 35 3 0 -1000 sshd
[15874.505337] [ 1178] 0 1178 3343 36 11 3 0 0 mdadm
[15874.505339] [ 1179] 0 1179 69278 163 39 3 0 0 polkitd
[15874.505341] [ 1242] 0 1242 4868 251 15 3 0 0 irqbalance
[15874.505343] [ 1317] 0 1317 3664 351 12 3 0 0 agetty
[15874.505345] [ 1321] 0 1321 3618 419 12 3 0 0 agetty
[15874.505348] [ 1657] 1000 1657 11287 495 27 3 0 0 systemd
[15874.505350] [ 1661] 1000 1661 15389 566 33 3 0 0 (sd-pam)
[15874.505352] [ 2612] 0 2612 23200 678 51 3 0 0 sshd
[15874.505354] [ 2667] 1000 2667 23391 441 48 3 0 0 sshd
[15874.505356] [ 2668] 0 2668 23199 673 48 3 0 0 sshd
[15874.505358] [ 2670] 1000 2670 5428 944 15 3 0 0 bash
[15874.505360] [ 2696] 0 2696 13938 539 32 3 0 0 sudo
[15874.505362] [ 2697] 0 2697 12751 509 30 4 0 0 su
[15874.505364] [ 2698] 0 2698 5077 612 14 3 0 0 bash
[15874.505366] [ 2738] 1000 2738 23199 292 47 3 0 0 sshd
[15874.505368] [ 2739] 1000 2739 3220 442 12 3 0 0 sftp-server
[15874.505371] [19655] 0 19655 23200 648 46 3 0 0 sshd
[15874.505373] [19710] 1000 19710 23268 576 47 3 0 0 sshd
[15874.505375] [19711] 0 19711 23199 668 50 3 0 0 sshd
[15874.505377] [19713] 1000 19713 5428 961 15 3 0 0 bash
[15874.505379] [19770] 1000 19770 23199 375 48 3 0 0 sshd
[15874.505381] [19771] 1000 19771 3220 441 12 3 0 0 sftp-server
[15874.505383] [19772] 0 19772 13938 554 32 3 0 0 sudo
[15874.505385] [19773] 0 19773 12751 492 30 4 0 0 su
[15874.505387] [19774] 0 19774 5032 596 14 3 0 0 bash
[15874.505390] [ 3386] 1001 3386 2666794 771894 2535 15 0 0 kurento-media-s
[15874.505392] [ 3402] 0 3402 899797 141796 389 6 0 0 java
[15874.505394] [ 3417] 0 3417 1500 320 8 3 0 0 tailf
[15874.505396] [12376] 0 12376 10130 422 24 3 0 0 top
[15874.505398] Out of memory: Kill process 3386 (kurento-media-s) score 765 or sacrifice child
[15874.511950] Killed process 3386 (kurento-media-s) total-vm:10667176kB, anon-rss:3087576kB, file-rss:0kB
root@ip-20-3-20-240:/var/log/kurento-media-server#
The text was updated successfully, but these errors were encountered: