Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Dialog callbacks corrupting route stack and AVPs #2681

Closed
bcnewlin opened this issue Nov 10, 2021 · 22 comments
Closed

[BUG] Dialog callbacks corrupting route stack and AVPs #2681

bcnewlin opened this issue Nov 10, 2021 · 22 comments

Comments

@bcnewlin
Copy link

OpenSIPS version you are running

version: opensips 3.2.3 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: f953b30a0
main.c compiled on 17:50:25 Nov  2 2021 with gcc 7

Describe the bug

Invoking the registered dlg_on_hangup callback is corrupting the route stack trace and dropping some AVP variables.

To Reproduce

  1. Create a config using dlg_on_hangup callback.
  2. In the main route, set some AVPs
  3. Place a call
  4. After the dialog match for the BYE, print the route stack and the AVPs
  5. Route stack will be truncated, only starting at the route that invoked the dialog match
  6. AVPs will no longer be available

Expected behavior

The route stack and AVPs should not be affected by the registered dialog callbacks

Relevant System Logs

OS/environment information

  • Operating System:
  • OpenSIPS installation:
  • other relevant information:

Additional context

@bcnewlin
Copy link
Author

Here is a paste of OpenSIPS logs showing the issue: https://pastebin.com/nnarxseg

It appears the dlg_on_hangup callback may be invoked twice and on the second invocation the test AVP I am printing has been cleared.

@liviuchircu liviuchircu self-assigned this Nov 11, 2021
@liviuchircu liviuchircu added this to the 3.2.4 milestone Nov 11, 2021
@liviuchircu
Copy link
Member

Thank you for the nice report, @bcnewlin! It is very possible that the $route support (which was added after the new dialog routes), is incomplete with regards to the dialog-related invocations, and fails to properly reset the stack state after the route call, or a similar issue.

I will attempt to reproduce the bug(s) based on your instructions, then come up with a fix.

@bcnewlin
Copy link
Author

bcnewlin commented Nov 11, 2021

@liviuchircu Thank you for the quick response. To be honest, I am less concerned with the route stack issue and more concerned with the disappearance of my AVPs after the callback.
It appears in 3.x all event_route execution is now async only (I did not see this documented in any change sets btw), so we are forced to switch to these callbacks from the dialog state change events, as we need synchronous execution to prevent the dialog and its data being gone before execution.

@bogdan-iancu
Copy link
Member

Hi @bcnewlin , the dialog routes, by design, do not intend to share the AVP set cross other dialog events. According to the definition here, AVPs are attached to msg or transaction context, not to dialog context. So, what you experience is correct.
Maybe using $dlg_val()'s instead ?

@bcnewlin
Copy link
Author

bcnewlin commented Nov 11, 2021

@bogdan-iancu I am not trying to access the AVPs from within the dialog route, except to print it in the trace I provided for clarity. The issue is that AVPs that are set in message processing before the dialog route is called are not all available after. I could be wrong, but I don't think it should be expected that using the dialog routes would cause AVPs to be removed from the transaction that is triggering them?
You can see in the trace I provided that even after script execution has returned to the request processing route the AVP continues to be null.
For the purposes of this example I removed all other processing from the dialog route except for an xlog, so there is no possibility that we are somehow accidentally clearing the AVP ourselves.

@bogdan-iancu
Copy link
Member

So, what you are saying that the while being in the request route, for the BYE request, you have one set of AVPs before the on_hangup being internally ran (while the internal dialog matching was done via loose_route/dialog_match/th_match) and after the route is terminated ?

@bcnewlin
Copy link
Author

bcnewlin commented Nov 11, 2021

Well not necessarily a different set, but at least some of the AVPs are not present after the on_hangup route with the values they had before the on_hangup route.
If you look at the trace I provided, there are xlog prints showing both the route stack ($(route[*])) and the value of an AVP we have set in the main request route ($avp(trans_event_source_port) := $sp;). Prior to the dialog callback the avp value prints correctly, but after the dialog callback, when the route stack resumes in handle_sequential, the AVP is now empty.

@bcnewlin
Copy link
Author

I tested setting a new AVP right before the call to th_match and it did retain its value after the script resumed, which is why I said only some AVPs are disappearing. I don't know what the difference is between the two. The trans_event_source_port AVP that is disappearing is one of the first ones we set when receiving any request, perhaps that is related? Maybe there is some memory corruption or something at the start of the AVP storage in the context? Perhaps related to the use of the dummy message?

@bcnewlin
Copy link
Author

As an update, I've been able to work around this by switching back to events and using the new load_dialog_ctx function. It appears that function will load dialogs even after they are ended, unlike get_dialog_vals which we were using before, so the async nature of the event is not a problem. In fact, it's nice to be able to have this execute async now.

I do still think this issue warrants attention and I'm happy to reproduce or provide any more information needed.

@bogdan-iancu
Copy link
Member

@bcnewlin , in regards to the issue with the AVPs, could you try the following patch:

diff --git a/modules/dialog/dlg_handlers.c b/modules/dialog/dlg_handlers.c
index 95f3b855c..9e9749fa0 100644
--- a/modules/dialog/dlg_handlers.c
+++ b/modules/dialog/dlg_handlers.c
@@ -122,6 +122,8 @@ int run_dlg_script_route(struct dlg_cell *dlg, int rt_idx)
        static struct sip_msg* fake_msg= NULL;
        context_p old_ctx, *new_ctx;
        int old_route_type;
+       struct usr_avp **old_avps;
+       struct usr_avp *local_avps;
 
        /************* pre-run sequance ****************/
 
@@ -130,6 +132,8 @@ int run_dlg_script_route(struct dlg_cell *dlg, int rt_idx)
                return -1;
        }
 
+       old_avps = set_avp_list( &local_avps );
+
        swap_route_type(old_route_type, REQUEST_ROUTE);
 
        /************* actual run sequance ****************/
@@ -147,8 +151,9 @@ int run_dlg_script_route(struct dlg_cell *dlg, int rt_idx)
                context_destroy(CONTEXT_GLOBAL, *new_ctx);
        current_processing_ctx = old_ctx;
 
-       /* remove all added AVP - here we use all the time the default AVP list */
-       reset_avps( );
+       /* remove all added AVP and restore the original list */
+       set_avp_list( old_avps );
+       destroy_avp_list( &local_avps );
 
        return 0;
 }

@bcnewlin
Copy link
Author

bcnewlin commented Nov 12, 2021

@bogdan-iancu Unfortunately, using this patch results in a segfault.

version: opensips 3.2.3 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: f953b30a0
main.c compiled on 19:45:45 Nov 12 2021 with gcc 7

Backtrace: https://pastebin.com/X2AvmLtD
Logs: https://pastebin.com/3hK6sPv4

@bogdan-iancu
Copy link
Member

ups, a missing, but important init of a variable, this this new patch:

diff --git a/modules/dialog/dlg_handlers.c b/modules/dialog/dlg_handlers.c
index 95f3b855c..8a8357d27 100644
--- a/modules/dialog/dlg_handlers.c
+++ b/modules/dialog/dlg_handlers.c
@@ -122,6 +122,8 @@ int run_dlg_script_route(struct dlg_cell *dlg, int rt_idx)
        static struct sip_msg* fake_msg= NULL;
        context_p old_ctx, *new_ctx;
        int old_route_type;
+       struct usr_avp **old_avps;
+       struct usr_avp *local_avps = NULL;
 
        /************* pre-run sequance ****************/
 
@@ -130,6 +132,8 @@ int run_dlg_script_route(struct dlg_cell *dlg, int rt_idx)
                return -1;
        }
 
+       old_avps = set_avp_list( &local_avps );
+
        swap_route_type(old_route_type, REQUEST_ROUTE);
 
        /************* actual run sequance ****************/
@@ -147,8 +151,9 @@ int run_dlg_script_route(struct dlg_cell *dlg, int rt_idx)
                context_destroy(CONTEXT_GLOBAL, *new_ctx);
        current_processing_ctx = old_ctx;
 
-       /* remove all added AVP - here we use all the time the default AVP list */
-       reset_avps( );
+       /* remove all added AVP and restore the original list */
+       set_avp_list( old_avps );
+       destroy_avp_list( &local_avps );
 
        return 0;
 }

@bcnewlin
Copy link
Author

I have verified that with this patch applied:

  • AVPs set outside and prior to the dialog route execution are not accessible within the dialog route
  • AVPs set outside and prior to the dialog route execution are available after execution returns to the main script route
  • AVPs set within the dialog route are not accessible after execution returns to the main script route

I believe these all align with the intended design.

Logs: https://pastebin.com/650Zz9zk

@bogdan-iancu
Copy link
Member

@bcnewlin , thanks for confirming on that. Is anything left here, or are we good to close ?

PS: the fix is here 7193c5d

@bcnewlin
Copy link
Author

bcnewlin commented Nov 15, 2021

@bogdan-iancu The route stack is still not correct after returning from the dialog callback. It would be a nice enhancement to fix that, but I will defer to you on whether you would like to keep this ticket open to track that or not. It is not of high importance to us at this time.

@bogdan-iancu
Copy link
Member

@bcnewlin , you mean the $route part, right ?

@bcnewlin
Copy link
Author

Yes, the route stack after script execution resumes from the dialog route does not include any routes called prior to the dialog route.

@github-actions
Copy link

github-actions bot commented Dec 2, 2021

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@liviuchircu
Copy link
Member

@bcnewlin I was able to reproduce the route stack bug, thank you for the info! A fix should be available soon.

@liviuchircu
Copy link
Member

liviuchircu commented Dec 21, 2021

@bcnewlin please don't test this patch right away, as I just realized I forgot to turn on enable_asserts in my opensips.cfg, so a suite of tests was completely skipped, which I thought were fully passing. There are still some small fixes to be done here :)

liviuchircu added a commit that referenced this issue Dec 21, 2021
The idea of "callback routes" was missed during the development of
$route, with this feature having been first introduced in 3.1.

Conceptually, a "callback route" is a contextless route that may be
invoked in a nested fashion, during the processing of a SIP message in
any of the routes.

For example, on a SIP BYE:
    1. route
	1.1 loose_route() (this triggers the "callback route",
	    e.g. see dlg_on_hangup())
	1.2 route "dlg_on_hangup" is run, contextless!
    2. script processing continues! <--- here, $route output was broken

This patch fixes the issue, by adding recursion detection into the
run_top_route() function, such that the $route variable works as
expected regardless of how many (nested) times it is called.

Many thanks to Ben Newlin for catching this issue, as well as providing
a detailed report!

Fixes #2681

(cherry picked from commit bb1fcc2)
@liviuchircu
Copy link
Member

OK, now it's fully fixed and backported to 3.2. Enjoy! :)

@bcnewlin
Copy link
Author

@liviuchircu Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants