Skip to content

Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*()#202

Open
leonardvimond wants to merge 2 commits intoDOCGroup:masterfrom
leonardvimond:master
Open

Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*()#202
leonardvimond wants to merge 2 commits intoDOCGroup:masterfrom
leonardvimond:master

Conversation

@leonardvimond
Copy link
Copy Markdown

@leonardvimond leonardvimond commented Feb 25, 2016

************ OVERVIEW ************
When a LocationForward occurred and that TAO is retrieving a profile to which it can get connected, then all new outgoing requests are blocked by a mutex in TAO_FT_Invocation_Endpoint_Selector::select_primary or in TAO_FT_Invocation_Endpoint_Selector::select_secondary, as long as the request in progress has not found any profile.
It looks like that each request, once it got the mutex, will try to connect to each profile of the IOGR at the moment it arrived, and will not necessarily use the IOGR updated by the first request. If some profiles are unreachable, then the attempts of connection can be long, and consequently all pending requests will be delayed.
If one configure a Relative RoundTrip Timeout, he will possibly get TIMEOUT to these requests while there would be enough time to get a reply from the new primary.

************ ISSUE ************
I have a use case with a FT client sending many requests to a FT replicated server, and the FT primary server is unplugged from the network.
We expect all requests to be forwarded to the new primary once the switch is over, but many requests get TIMEOUT instead.

For a disconnection of 10.100.14.96 at 16:50:01Z and a RTTT=20s (/var/log/messages-20160214:Feb 12 16:50:01 systint85 kernel: bnx2 0000:03:00.0: eth0: NIC Copper Link is Down), the failure of TCP connection is detected after 6s (as expected, thanks to the TCP Keep Alive we have configured):
#16:50:08.107683

TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, timeout after recv is <13602> status <-1>
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, recovering after an error
...
#16:50:23.041159

TAO_FT (20595|140665202775808) - Got a primary component

And then some attempts of reconnection fail after 3s, accordingly to the TCP parameter tcp_retries2=3.
#16:50:23.042602

TAO (20595|140665202775808) - IIOP_Connector::begin_connection, to 10.100.14.96:11063 which should block
...
#16:50:26.481488

TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, timeout after recv is <0> status <-1>
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, recovering after an error

A very long time (15s here) is spent between the failure and the first attempt of reconnection, which looks to be only explained by the time needed to gain the mutex in FTSelector.
All requests will pay the cost of 3s when attempting the unreachable profile, and last ones will finish with a TIMEOUT.

************ FIX ************
Making a copy of profiles and release immediately the Mutex enables to all requests to be processed at the same time, they will all try to find the right profile concurrently.
That fix has been validated on the old TAO-V161, however the relative code looks to have been very stable since then and it may work the same in latest releases.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced system stability by addressing potential concurrency issues and refining error handling.
  • Refactor
    • Streamlined the profile selection process to improve overall robustness and resilience.

OVERVIEW
When a LocationForward occurred and that TAO is retrieving a profile to which it can get connected, then all new outgoing requests are blocked by a mutex in TAO_FT_Invocation_Endpoint_Selector::select_primary or in TAO_FT_Invocation_Endpoint_Selector::select_secondary, as long as the request in progress has not found any profile.
It looks like that each request, once it got the mutex, will try to connect to each profile of the IOGR at the moment it arrived, and will not necessarily use the IOGR updated by the first request. If some profiles are unreachable, then the attempts of connection can be long, and consequently all pending requests will be delayed. 
If one configure a Relative RoundTrip Timeout, he will possibly get TIMEOUT to these requests while there would be enough time to get a reply from the new primary.

ISSUE
I have a use case with a FT client sending many requests to a FT (replicated) server, and the FT primary
Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*()
@jwillemsen
Copy link
Copy Markdown
Member

Can you add or extend an automated unit test for this?

@jwillemsen jwillemsen added the needs review Needs to be reviewed label May 22, 2020
@jwillemsen
Copy link
Copy Markdown
Member

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 5, 2025

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 5, 2025

Walkthrough

The changes improve the thread safety and clarity of endpoint selection within the FT_Invocation_Endpoint_Selectors class. Both the select_primary and select_secondary methods now use a local auxiliary TAO_MProfile combined with a scoped lock when accessing profile lists. This adjustment ensures safer concurrent access to shared resources by conditionally handling forward_profiles and base_profiles. Additionally, the exception handling logic has been modified to defer throwing the TRANSIENT exception until later in the invocation process. There is no change to the public interfaces.

Changes

File Change Summary
TAO/.../FaultTolerance/FT_Invocation_Endpoint_Selectors.cpp Updated select_primary and select_secondary methods to use a local auxiliary TAO_MProfile with a scoped lock for thread safety; deferred throwing of TRANSIENT exception

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Selector as FT_Invocation_Endpoint_Selector
    participant Lock as Scoped Lock
    participant Profile as TAO_MProfile_aux

    Client->>Selector: Call select_primary/select_secondary()
    Selector->>Lock: Acquire lock
    Selector->>Selector: Check if forward_profiles exists
    alt forward_profiles available
        Selector->>Profile: Copy forward_profiles into local auxiliary
        Selector->>Lock: Release lock
        Selector->>Client: Return pointer to local profile list
    else No forward_profiles
        Selector->>Lock: Release lock
        Selector->>Client: Return pointer to base_profiles list
    end
    Note right of Selector: Exception deferred for later handling
Loading

Poem

I'm a clever rabbit, hopping in delight,
Safeguarding profiles by day and by night.
With locks and aux tricks, my code's set to soar,
Exceptions wait patiently, no more a bore!
Through tangled threads, I happily roam,
Celebrating smooth logic, in my coder's home!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
TAO/orbsvcs/orbsvcs/FaultTolerance/FT_Invocation_Endpoint_Selectors.cpp (2)

52-77: LGTM! Thread-safe profile handling with reduced lock scope.

The implementation successfully addresses the blocking issue by:

  1. Minimizing the mutex lock scope to just profile retrieval
  2. Safely copying forwarded profiles to a local auxiliary object

Consider adding thread safety documentation.

While the implementation is correct, it would be helpful to add a comment explaining that the profile copy allows safe access after releasing the lock.

Add this comment before line 74:

+        // Copy the forward_profiles to allow safe access after releasing the lock
         prof_list_aux.set(*forward_prof_list);

117-142: Consider refactoring duplicated profile retrieval logic.

The profile retrieval logic is identical in both select_primary and select_secondary methods. Consider extracting this into a private helper method to improve maintainability and reduce duplication.

Here's a suggested refactor:

+  private:
+    bool get_profile_list(TAO::Profile_Transport_Resolver *r,
+                         TAO_MProfile& prof_list_aux,
+                         TAO_MProfile*& prof_list) {
+      // Set lock, as forward_profiles might be deleted concurrently.
+      ACE_MT (ACE_GUARD_RETURN (TAO_SYNCH_MUTEX,
+                                guard,
+                                const_cast <TAO_SYNCH_MUTEX &> (r->stub ()->profile_lock ()),
+                                false));
+
+      // Grab the forwarded list
+      TAO_MProfile *forward_prof_list =
+        const_cast<TAO_MProfile *> (r->stub ()->forward_profiles ());
+
+      if (forward_prof_list == 0) {
+        TAO_MProfile &basep = r->stub ()->base_profiles ();
+        prof_list = &basep;
+      } else {
+        // Copy the forward_profiles to allow safe access after releasing the lock
+        prof_list_aux.set(*forward_prof_list);
+        prof_list = &prof_list_aux;
+      }
+      return true;
+    }

Then in both select_primary and select_secondary:

   TAO_MProfile *prof_list;
   TAO_MProfile prof_list_aux;
-  // Retrieve the list of profiles to be used.
-  // Set lock, as forward_profiles might be deleted concurrently.
-  {
-    ACE_MT (ACE_GUARD_RETURN (TAO_SYNCH_MUTEX,
-                              guard,
-                              const_cast <TAO_SYNCH_MUTEX &> (r->stub ()->profile_lock ()),
-                              false));
-    // Grab the forwarded list
-    TAO_MProfile *forward_prof_list =
-      const_cast<TAO_MProfile *> (r->stub ()->forward_profiles ());
-
-    if (forward_prof_list == 0) {
-      TAO_MProfile &basep = r->stub ()->base_profiles ();
-      prof_list = &basep;
-    } else {
-      prof_list_aux.set(*forward_prof_list);
-      prof_list = &prof_list_aux;
-    }
-  }
+  if (!get_profile_list(r, prof_list_aux, prof_list))
+    return false;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fdd3f9a and a7f0d24.

📒 Files selected for processing (1)
  • TAO/orbsvcs/orbsvcs/FaultTolerance/FT_Invocation_Endpoint_Selectors.cpp (2 hunks)
🔇 Additional comments (1)
TAO/orbsvcs/orbsvcs/FaultTolerance/FT_Invocation_Endpoint_Selectors.cpp (1)

36-42: LGTM! Clear explanation of exception handling change.

The updated comment effectively explains why the TRANSIENT exception is deferred to allow request interception points to potentially handle the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs review Needs to be reviewed

Development

Successfully merging this pull request may close these issues.

2 participants