Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra 2xx_transaction on reply retransmission #1636

Closed
bcnewlin opened this issue Mar 13, 2019 · 16 comments
Closed

Extra 2xx_transaction on reply retransmission #1636

bcnewlin opened this issue Mar 13, 2019 · 16 comments
Assignees
Milestone

Comments

@bcnewlin
Copy link

bcnewlin commented Mar 13, 2019

I have found a scenario where I believe there is an error in statistics collection for the TM module. I am attaching a trace where OpenSIPS is the node at 203.0.113.3.

In this scenario the ACK to the 200 OK response is delayed and the 200 response is retransmitted. After this call, the TM module showed the following statistics:

4xx_transactions: 0,
UAS_transactions: 2,
2xx_transactions: 3,
relayed_replies: 4,
5xx_transactions: 0,
received_replies: 7,
3xx_transactions: 0,
UAC_transactions: 0,
local_replies: 1,
6xx_transactions: 0,
inuse_transactions: 2

There should be only two 2xx_transactions: the INVITE and the BYE. I believe the retransmitted 200 OK is being erroneously counted as an additional transaction.

opensips -V

version: opensips 2.4.4 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 52ff74a
main.c compiled on 20:42:40 Mar 13 2019 with gcc 7

extra_transaction.pcap.zip

@bogdan-iancu
Copy link
Member

@bcnewlin , I can confirm (by checking the code) that the statistics are actually counting the number of relayed replies (with certain code classes).
Instead of doing the counting upon reply relaying, a better approach may be to update the stats (based on the last reply sent to caller side) when the transaction is destroyed. This will guarantee only one counting per transaction.

@stale
Copy link

stale bot commented Jun 28, 2019

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@stale stale bot added the stale label Jun 28, 2019
@bcnewlin
Copy link
Author

bcnewlin commented Jul 9, 2019

Updating to prevent stale bot from closing.

@stale stale bot removed the stale label Jul 9, 2019
@bogdan-iancu
Copy link
Member

@bcnewlin , could you comment on my above reply ?

@bcnewlin
Copy link
Author

@bogdan-iancu Apologies, I didn't realize you were looking for a response. I think your suggested approach is a good one.

@bogdan-iancu
Copy link
Member

OK, thanks for confirmation, I will work out a fix in this direction.

@stale
Copy link

stale bot commented Jul 26, 2019

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@stale stale bot added the stale label Jul 26, 2019
@bogdan-iancu
Copy link
Member

@bcnewlin could you give a spin to this patch? to be honest with you I haven;t tested at all, so I rely on you for some full testing of the stats :)

@stale stale bot removed the stale label Jul 26, 2019
@bcnewlin
Copy link
Author

@bogdan-iancu This patch seems to be causing transaction counts to be not counted in some cases. On our server that uses only TM it appears to be fine, but on our server that uses Dialog module the INVITE transaction is not counted on reply.

Here are the stats after a single successful call, which is INVITE and BYE both with 200OK:
4xx_transactions: 0,
UAS_transactions: 2,
2xx_transactions: 1,
relayed_replies: 3,
5xx_transactions: 0,
received_replies: 4,
3xx_transactions: 0,
UAC_transactions: 0,
local_replies: 1,
6xx_transactions: 0,
inuse_transactions: 1

You can see one 200OK is not counted and inuse_transactions is left at 1. I believe it is the INVITE that is getting missed as when I run a call with only an INVITE and 486 response I see these stats:
4xx_transactions: 0,
UAS_transactions: 1,
2xx_transactions: 0,
relayed_replies: 2,
5xx_transactions: 0,
received_replies: 3,
3xx_transactions: 0,
UAC_transactions: 0,
local_replies: 1,
6xx_transactions: 0,
inuse_transactions: 1

@bogdan-iancu
Copy link
Member

@bcnewlin , maybe to pull the stats too fast after the call end. The transactions are deleted (by default) after like 5 secs after final reply. This may explain why the BYE transaction is still in use, and not yet counted as 2xx. Just be sure you wait 10 secs or so. Right now the Nxx_transactions stats are updated when the transactions are destroyed , so after a bit of time after their completion.

@bcnewlin
Copy link
Author

@bogdan-iancu You are right, inserting a delay before pulling the statistics resolved the issue. This is not too bad a side effect in practical use, but will add significant time to our testbed unfortunately. I have decreased the wt_timer TM parameter in our tests to compensate, but that may potentially increase our exposure to other scheduling and timing issues. However, that is our problem.

I will continue testing the patch and let you know the results.

@bogdan-iancu
Copy link
Member

@bcnewlin , I agree that the delay is a bit undesired. So, please check this new patch, where the stats are updated when inserting in the wait timer, and not when removing. It should get rid of that delay.

@bcnewlin
Copy link
Author

bcnewlin commented Jul 28, 2019

@bogdan-iancu I have verified the original patch fixes the scenario described in this ticket and reports the correct stats for all our other tests.

For the new patch, I have verified that it fixes the scenario described in this ticket and does not require a delay before collecting stats. However, I am now seeing an extra 4xx_transaction count for scenarios involving authentication on our server using Dialog. Even if no second Invite is sent with credentials (because no matching credentials are found), I still see two 4xx_transactions (and two received_replies) but only one UAS_transaction:

4xx_transactions: 2,
UAS_transactions: 1,
2xx_transactions: 0,
relayed_replies: 1,
5xx_transactions: 0,
received_replies: 2,
3xx_transactions: 0,
UAC_transactions: 0,
local_replies: 1,
6xx_transactions: 0,
inuse_transactions: 0

I'm attaching a trace for this scenario, in which the OpenSIPS server using Dialog is at 203.0.113.3.
auth_stats_trace.pcap.zip

I've verified this is not the case for generic 4xx responses, it seems to only occur for 401, though possibly also for 407 as we have no test for that since SIPp does not support it.

@bogdan-iancu
Copy link
Member

@bcnewlin , the thing you observed may be related to this part of the code - when I sent you the second patch I had in my mind the issue with "double" put_on_wait() , but I considered it as result of some fubar signaling, not something derived from day by day traffic :D
I'm still 100% convinced that the right moment to account the transactions (per code) is into a place that happens only once in the transaction lifetime (like destroying, but that's too late, not quite realtime).
So, here is a slightly modified version of the second patch, which takes care of multiple put_on_wait() - see tm_trans_stats_fix.patch3.txt. Hopefully it will address all the know issues :)

@bcnewlin
Copy link
Author

bcnewlin commented Aug 1, 2019

@bogdan-iancu I think you are right about that code being the culprit.

I've run all of our tests against this patch and verified it produces the expected statistics in all cases with no delay. I've also verified it produces the correct statistics in the reported case of a retransmission.

@bogdan-iancu
Copy link
Member

Perfect, thank you @bcnewlin for the testing, I will push the fix and backport it also.

@bogdan-iancu bogdan-iancu added this to the 2.4.7 milestone Aug 1, 2019
bogdan-iancu added a commit that referenced this issue Aug 1, 2019
The fix consist of counting only ONCE each transaction (for the xxx_transaction statistics), disregarding its retransmissions, multiple 200OK replies or final code override (like 200 after a negative reply).

Closes  #1636
Many thanks to @bcnewlin for his testing and validation of the fix.

(cherry picked from commit 68bedfe)
bogdan-iancu added a commit that referenced this issue Aug 1, 2019
The fix consist of counting only ONCE each transaction (for the xxx_transaction statistics), disregarding its retransmissions, multiple 200OK replies or final code override (like 200 after a negative reply).

Closes  #1636
Many thanks to @bcnewlin for his testing and validation of the fix.

(cherry picked from commit 68bedfe)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants