Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*()#202
Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*()#202leonardvimond wants to merge 2 commits intoDOCGroup:masterfrom
Conversation
OVERVIEW When a LocationForward occurred and that TAO is retrieving a profile to which it can get connected, then all new outgoing requests are blocked by a mutex in TAO_FT_Invocation_Endpoint_Selector::select_primary or in TAO_FT_Invocation_Endpoint_Selector::select_secondary, as long as the request in progress has not found any profile. It looks like that each request, once it got the mutex, will try to connect to each profile of the IOGR at the moment it arrived, and will not necessarily use the IOGR updated by the first request. If some profiles are unreachable, then the attempts of connection can be long, and consequently all pending requests will be delayed. If one configure a Relative RoundTrip Timeout, he will possibly get TIMEOUT to these requests while there would be enough time to get a reply from the new primary. ISSUE I have a use case with a FT client sending many requests to a FT (replicated) server, and the FT primary
Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*()
|
Can you add or extend an automated unit test for this? |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
WalkthroughThe changes improve the thread safety and clarity of endpoint selection within the FT_Invocation_Endpoint_Selectors class. Both the Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Selector as FT_Invocation_Endpoint_Selector
participant Lock as Scoped Lock
participant Profile as TAO_MProfile_aux
Client->>Selector: Call select_primary/select_secondary()
Selector->>Lock: Acquire lock
Selector->>Selector: Check if forward_profiles exists
alt forward_profiles available
Selector->>Profile: Copy forward_profiles into local auxiliary
Selector->>Lock: Release lock
Selector->>Client: Return pointer to local profile list
else No forward_profiles
Selector->>Lock: Release lock
Selector->>Client: Return pointer to base_profiles list
end
Note right of Selector: Exception deferred for later handling
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
TAO/orbsvcs/orbsvcs/FaultTolerance/FT_Invocation_Endpoint_Selectors.cpp (2)
52-77: LGTM! Thread-safe profile handling with reduced lock scope.The implementation successfully addresses the blocking issue by:
- Minimizing the mutex lock scope to just profile retrieval
- Safely copying forwarded profiles to a local auxiliary object
Consider adding thread safety documentation.
While the implementation is correct, it would be helpful to add a comment explaining that the profile copy allows safe access after releasing the lock.
Add this comment before line 74:
+ // Copy the forward_profiles to allow safe access after releasing the lock prof_list_aux.set(*forward_prof_list);
117-142: Consider refactoring duplicated profile retrieval logic.The profile retrieval logic is identical in both
select_primaryandselect_secondarymethods. Consider extracting this into a private helper method to improve maintainability and reduce duplication.Here's a suggested refactor:
+ private: + bool get_profile_list(TAO::Profile_Transport_Resolver *r, + TAO_MProfile& prof_list_aux, + TAO_MProfile*& prof_list) { + // Set lock, as forward_profiles might be deleted concurrently. + ACE_MT (ACE_GUARD_RETURN (TAO_SYNCH_MUTEX, + guard, + const_cast <TAO_SYNCH_MUTEX &> (r->stub ()->profile_lock ()), + false)); + + // Grab the forwarded list + TAO_MProfile *forward_prof_list = + const_cast<TAO_MProfile *> (r->stub ()->forward_profiles ()); + + if (forward_prof_list == 0) { + TAO_MProfile &basep = r->stub ()->base_profiles (); + prof_list = &basep; + } else { + // Copy the forward_profiles to allow safe access after releasing the lock + prof_list_aux.set(*forward_prof_list); + prof_list = &prof_list_aux; + } + return true; + }Then in both
select_primaryandselect_secondary:TAO_MProfile *prof_list; TAO_MProfile prof_list_aux; - // Retrieve the list of profiles to be used. - // Set lock, as forward_profiles might be deleted concurrently. - { - ACE_MT (ACE_GUARD_RETURN (TAO_SYNCH_MUTEX, - guard, - const_cast <TAO_SYNCH_MUTEX &> (r->stub ()->profile_lock ()), - false)); - // Grab the forwarded list - TAO_MProfile *forward_prof_list = - const_cast<TAO_MProfile *> (r->stub ()->forward_profiles ()); - - if (forward_prof_list == 0) { - TAO_MProfile &basep = r->stub ()->base_profiles (); - prof_list = &basep; - } else { - prof_list_aux.set(*forward_prof_list); - prof_list = &prof_list_aux; - } - } + if (!get_profile_list(r, prof_list_aux, prof_list)) + return false;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
TAO/orbsvcs/orbsvcs/FaultTolerance/FT_Invocation_Endpoint_Selectors.cpp(2 hunks)
🔇 Additional comments (1)
TAO/orbsvcs/orbsvcs/FaultTolerance/FT_Invocation_Endpoint_Selectors.cpp (1)
36-42: LGTM! Clear explanation of exception handling change.The updated comment effectively explains why the TRANSIENT exception is deferred to allow request interception points to potentially handle the issue.
************ OVERVIEW ************
When a LocationForward occurred and that TAO is retrieving a profile to which it can get connected, then all new outgoing requests are blocked by a mutex in TAO_FT_Invocation_Endpoint_Selector::select_primary or in TAO_FT_Invocation_Endpoint_Selector::select_secondary, as long as the request in progress has not found any profile.
It looks like that each request, once it got the mutex, will try to connect to each profile of the IOGR at the moment it arrived, and will not necessarily use the IOGR updated by the first request. If some profiles are unreachable, then the attempts of connection can be long, and consequently all pending requests will be delayed.
If one configure a Relative RoundTrip Timeout, he will possibly get TIMEOUT to these requests while there would be enough time to get a reply from the new primary.
************ ISSUE ************
I have a use case with a FT client sending many requests to a FT replicated server, and the FT primary server is unplugged from the network.
We expect all requests to be forwarded to the new primary once the switch is over, but many requests get TIMEOUT instead.
For a disconnection of 10.100.14.96 at 16:50:01Z and a RTTT=20s (/var/log/messages-20160214:Feb 12 16:50:01 systint85 kernel: bnx2 0000:03:00.0: eth0: NIC Copper Link is Down), the failure of TCP connection is detected after 6s (as expected, thanks to the TCP Keep Alive we have configured):
#16:50:08.107683
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, timeout after recv is <13602> status <-1>
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, recovering after an error
...
#16:50:23.041159
TAO_FT (20595|140665202775808) - Got a primary component
And then some attempts of reconnection fail after 3s, accordingly to the TCP parameter tcp_retries2=3.
#16:50:23.042602
TAO (20595|140665202775808) - IIOP_Connector::begin_connection, to 10.100.14.96:11063 which should block
...
#16:50:26.481488
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, timeout after recv is <0> status <-1>
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, recovering after an error
A very long time (15s here) is spent between the failure and the first attempt of reconnection, which looks to be only explained by the time needed to gain the mutex in FTSelector.
All requests will pay the cost of 3s when attempting the unreachable profile, and last ones will finish with a TIMEOUT.
************ FIX ************
Making a copy of profiles and release immediately the Mutex enables to all requests to be processed at the same time, they will all try to find the right profile concurrently.
That fix has been validated on the old TAO-V161, however the relative code looks to have been very stable since then and it may work the same in latest releases.
Summary by CodeRabbit