Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] infinitely loop at CRcvLossList::remove() #2192

Closed
gou4shi1 opened this issue Dec 1, 2021 · 5 comments · Fixed by #2195
Closed

[BUG] infinitely loop at CRcvLossList::remove() #2192

gou4shi1 opened this issue Dec 1, 2021 · 5 comments · Fixed by #2195
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@gou4shi1
Copy link
Contributor

gou4shi1 commented Dec 1, 2021

Describe the bug
I'm using group mode to receive tsbpd video, and suddenly all videos stuck, then I found that one tsbpd thread is using 100% CPU, so I use gdb to attach into that thread:

(gdb) where
#0  0x000056162d561748 in CRcvLossList::remove (this=this@entry=0x7f5ca39713b0, seqno=seqno@entry=155602316)
    at /worker/build/68c89ff87558e392/root/external/srt/srtcore/list.cpp:626
#1  0x000056162d5619aa in CRcvLossList::remove (this=0x7f5ca39713b0, seqno1=<optimized out>, seqno2=<optimized out>)
    at /worker/build/68c89ff87558e392/root/external/srt/srtcore/list.cpp:697
#2  0x000056162d5369ed in srt::CUDT::dropFromLossLists (this=this@entry=0x7f5ca3d19118, from=from@entry=155602313, to=to@entry=155602321)
    at /worker/build/68c89ff87558e392/root/external/srt/srtcore/core.cpp:10318
#3  0x000056162d536237 in srt::CUDT::updateForgotten (this=0x7f5ca3d19118, seqlen=9, lastack=155602313, skiptoseqno=<optimized out>)
    at /worker/build/68c89ff87558e392/root/external/srt/srtcore/core.cpp:5383
#4  srt::CUDT::tsbpd (param=0x7f5ca3d19118) at /worker/build/68c89ff87558e392/root/external/srt/srtcore/core.cpp:5225
(gdb) p *this
$2 = {m_caSeq = 0x7f5c85094cc0, m_iHead = -1, m_iTail = 13690, m_iLength = 1234, m_iSize = 25600}

There are several strange phenomenons:

  1. m_iHead is -1
  2. From 10:14:08 to 11:43:39, it keep printing SRT.xt: sendCtrl(UMSG_ACK): IPE: curr %196624 <% last %155161598, and the videos stuck at around 11:43. I don't why getFirstLostSeq() would return 196624 since all the packets of that socket at that morning are roughly from 155018194 to 155602313.
  3. m_iRcvLastAck=155161598 while m_iRcvLastSkipAck=155602313 and m_iRcvCurrSeqNo=155602322

To Reproduce
SRT broadcast videos to the server in live mode and the network is bad.

Expected behavior
LossList should work as expected.

Desktop (please provide the following information):

  • OS: Linux
  • SRT Version / commit ID: 1.44 / 8b32f37
@gou4shi1 gou4shi1 added the Type: Bug Indicates an unexpected problem or unintended behavior label Dec 1, 2021
@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Dec 1, 2021

It happened again in the next day, at this time, thread SRT:RcvQ:w1 used 100% CPU:

(gdb) where
#0  0x000055c0b1509748 in CRcvLossList::remove (this=this@entry=0x7f40b10f75e0, seqno=seqno@entry=662003842)
    at /worker/build/68c89ff87558e392/root/external/srt/srtcore/list.cpp:626
#1  0x000055c0b15099aa in CRcvLossList::remove (this=0x7f40b10f75e0, seqno1=<optimized out>, seqno2=<optimized out>)
    at /worker/build/68c89ff87558e392/root/external/srt/srtcore/list.cpp:697
#2  0x000055c0b14de9ed in srt::CUDT::dropFromLossLists (this=this@entry=0x7f40dcd13d08, from=662003842, to=662003845)
    at /worker/build/68c89ff87558e392/root/external/srt/srtcore/core.cpp:10318
#3  0x000055c0b14e7367 in srt::CUDT::processCtrlDropReq (this=this@entry=0x7f40dcd13d08, ctrlpkt=...)
    at /worker/build/68c89ff87558e392/root/external/srt/srtcore/core.cpp:8672
#4  0x000055c0b14e7809 in srt::CUDT::processCtrl (this=0x7f40dcd13d08, ctrlpkt=...)
    at /worker/build/68c89ff87558e392/root/external/srt/srtcore/core.cpp:8783
#5  0x000055c0b14fe06a in srt::CRcvQueue::worker_ProcessAddressedPacket (this=0x55c0b242df70, id=<optimized out>, unit=0x3483, addr=...)
    at /worker/build/68c89ff87558e392/root/external/srt/srtcore/queue.cpp:1483
#6  0x000055c0b14fd101 in srt::CRcvQueue::worker (param=0x55c0b242df70)
    at /worker/build/68c89ff87558e392/root/external/srt/srtcore/queue.cpp:1262
#7  0x00007f426343ea00 in ?? () from target:/home/jingchi/remote_server/lib/libstdc++.so.6
#8  0x00007f426495e6ba in start_thread (arg=0x7f41d17fa700) at pthread_create.c:333
#9  0x00007f4262c8a4dd in clone () from target:/lib/x86_64-linux-gnu/libc.so.6
(gdb) p *this
$1 = {m_caSeq = 0x7f40dec51270, m_iHead = 0, m_iTail = 3, m_iLength = 1, m_iSize = 25600}

I also printed the value of m_caSeq[0...3] in gdb, all are {seqstart=-1, seqend=-1}.
At this time, no IPE is printed, but I believe that the result of getFirstLostSeq() is still invalid since m_iRcvLastAck=661953185 while m_iRcvLastSkipAck=662004389 and m_iRcvCurrSeqNo=662004394

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Dec 1, 2021

I believe that it was stuck at

    // searching previous node
    int i = (loc - 1 + m_iSize) % m_iSize;
    while (SRT_SEQNO_NONE == m_caSeq[i].seqstart)
        i = (i - 1 + m_iSize) % m_iSize;

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Dec 1, 2021

I added some debug logs, and it happened the third time, now I found out why getFirstLostSeq() would keep return wrong result:

14:44:06.810078/SRT:RcvQ:w1 D:SRT.xt: @207803157: sendCtrlAck(): m_pRcvLossList->getFirstLostSeq()=-1
14:44:06.810121/SRT:TsbPd D:SRT.qr: @207803157: dropFromLossLists(): m_pRcvLossList->remove(899264403, 899264403)=1
14:44:06.810152/SRT:TsbPd!W:SRT.br: @207803157:RCV-DROPPED 1 packet(s), packet seqno %899264404 delayed for 1992.087 ms
14:44:06.810157/SRT:RcvQ:w1 D:SRT.qr: @207803157: processData(): m_pRcvLossList->insert(899264403, 899264403)
14:44:06.814346/SRT:RcvQ:w1 D:SRT.qr: @207803157: processData(): m_pRcvLossList->insert(899264408, 899264408)
14:44:06.814379/SRT:RcvQ:w1 D:SRT.qr: @207803157: unlose(): m_pRcvLossList->remove(899264408)=1
14:44:06.814408/SRT:TsbPd D:SRT.qr: @207803157: dropFromLossLists(): m_pRcvLossList->remove(899264404, 899264410)=1
14:44:06.814438/SRT:TsbPd!W:SRT.br: @207803157:RCV-DROPPED 7 packet(s), packet seqno %899264411 delayed for 1996.374 ms
14:44:06.814743/SRT:RcvQ:w1 D:SRT.qr: @207803157: dropFromLossLists(): m_pRcvLossList->remove(899263551, 899263742)=1
14:44:06.820454/SRT:RcvQ:w1 D:SRT.xt: @207803157: sendCtrlAck(): m_pRcvLossList->getFirstLostSeq()=899264403
14:44:06.830957/SRT:RcvQ:w1 D:SRT.xt: @207803157: sendCtrlAck(): m_pRcvLossList->getFirstLostSeq()=899264403
14:44:06.836730/SRT:RcvQ:w1 D:SRT.qr: @207803157: dropFromLossLists(): m_pRcvLossList->remove(899263755, 899263759)=1
14:44:06.836748/SRT:TsbPd D:SRT.qr: @207803157: dropFromLossLists(): m_pRcvLossList->remove(899264411, 899264411)=1
14:44:06.836771/SRT:TsbPd!W:SRT.br: @207803157:RCV-DROPPED 1 packet(s), packet seqno %899264412 delayed for 2018.706 ms
14:44:06.842041/SRT:RcvQ:w1 D:SRT.xt: @207803157: sendCtrlAck(): m_pRcvLossList->getFirstLostSeq()=899264403
14:44:06.847927/SRT:RcvQ:w1 D:SRT.qr: @207803157: dropFromLossLists(): m_pRcvLossList->remove(899263778, 899263779)=1
14:44:06.852335/SRT:RcvQ:w1 D:SRT.xt: @207803157: sendCtrlAck(): m_pRcvLossList->getFirstLostSeq()=899264403m_iHead=0, m_iTail=0, loc=0

When processData() received 899264404, it was added into the recv buffer, after recvbuf_acklock was unlocked, the tsbpd thread played 899264404 and dropped 899264403, then processData() immediately insert 899264403 back into loss list, after this, getFirstLostSeq() keep return 899264403

@maxsharabayko maxsharabayko added the [core] Area: Changes in SRT library core label Dec 1, 2021
@maxsharabayko maxsharabayko added this to the v1.4.5 milestone Dec 1, 2021
@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Dec 1, 2021

From the above debug logs, I also found out why CRcvLossList::m_iLength would wrong. Since 899264403 was left in the loss list, when CRcvLossList::insert(899290003) was called, seqoff(899264403, 899290003) is exactly m_iSize, so m_caSeq[loc] was updated from 899264403 to 899290003 and m_iLength was increased. As a result, m_iLength was inconsistent with m_iHead and m_iTail.

@maxsharabayko
Copy link
Collaborator

maxsharabayko commented Dec 1, 2021

Sounds like either m_RcvBufferLock has to be locked until RcvLossList is updated (which would be bad for performance), or the RcvLossList should keep track of the first unacknowledged position (which is already tracked by CUDT::m_iRcvLastSkipAck, but protected with m_RcvBufferLock). 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants