Skip to content

fix: turn atlas-connect-cluster async #343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jul 10, 2025
Merged

fix: turn atlas-connect-cluster async #343

merged 26 commits into from
Jul 10, 2025

Conversation

fmenezes
Copy link
Collaborator

@fmenezes fmenezes commented Jul 7, 2025

fixes #321

Proposed changes

turn atlas-connect-cluster tool into an async tool. Unfortunately, rely on atlas user creation api to connect, sometimes the propagation time of the user from control plane to the data plane can take time, there is no state to query other than check if user has access to DB.

Checklist

@coveralls
Copy link
Collaborator

coveralls commented Jul 7, 2025

Pull Request Test Coverage Report for Build 16194572196

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 33 of 51 (64.71%) changed or added relevant lines in 3 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.8%) to 74.831%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/session.ts 3 5 60.0%
src/tools/mongodb/mongodbTool.ts 5 8 62.5%
src/tools/atlas/metadata/connectCluster.ts 25 38 65.79%
Files with Coverage Reduction New Missed Lines %
src/tools/atlas/metadata/connectCluster.ts 2 61.74%
Totals Coverage Status
Change from base Build 16135914646: -0.8%
Covered Lines: 863
Relevant Lines: 1065

💛 - Coveralls

@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 16123662219

Details

  • 0 of 34 (0.0%) changed or added relevant lines in 1 file are covered.
  • 129 unchanged lines in 15 files lost coverage.
  • Overall coverage decreased (-13.3%) to 60.879%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/tools/atlas/metadata/connectCluster.ts 0 34 0.0%
Files with Coverage Reduction New Missed Lines %
src/common/atlas/generatePassword.ts 3 25.0%
src/tools/atlas/create/createFreeCluster.ts 3 57.14%
src/tools/atlas/read/inspectCluster.ts 3 23.53%
src/tools/atlas/read/listAlerts.ts 3 20.0%
src/tools/atlas/read/listOrgs.ts 3 72.73%
src/tools/atlas/read/inspectAccessList.ts 4 33.33%
src/tools/atlas/create/createDBUser.ts 6 25.0%
src/tools/atlas/create/createProject.ts 8 46.67%
src/common/atlas/cluster.ts 9 0.0%
src/tools/atlas/create/createAccessList.ts 10 16.13%
Totals Coverage Status
Change from base Build 16077040616: -13.3%
Covered Lines: 704
Relevant Lines: 1046

💛 - Coveralls

@fmenezes fmenezes marked this pull request as ready for review July 8, 2025 06:54
@Copilot Copilot AI review requested due to automatic review settings July 8, 2025 06:54
@fmenezes fmenezes requested a review from a team as a code owner July 8, 2025 06:54
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the atlas-connect-cluster tool into an asynchronous background process and updates integration tests to poll until the cluster connection is established.

  • Tests now loop with retries to wait for the async connection to succeed.
  • ConnectClusterTool has been split into query, prepare, and background connect phases, returning immediately with an “Attempting…” message.
  • Added new log IDs for connect attempts and successes in logger.ts.
Comments suppressed due to low confidence (1)

tests/integration/tools/atlas/clusters.test.ts:195

  • Add an assertion after the loop (or inside the success branch) to fail the test if the cluster never reports "Cluster is already connected.", otherwise the test may silently pass without verifying a successful connection.
                for (let i = 0; i < 600; i++) {

Comment on lines 43 to 46
await this.session.serviceProvider.runCommand("admin", {
ping: 1,
});
return "connected";
Copy link
Preview

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap the runCommand('admin', { ping: 1 }) call in a try/catch so transient ping errors don’t bubble up and trigger a full reconnection flow prematurely.

Suggested change
await this.session.serviceProvider.runCommand("admin", {
ping: 1,
});
return "connected";
try {
await this.session.serviceProvider.runCommand("admin", {
ping: 1,
});
return "connected";
} catch (error) {
logger.warn(LogId.ConnectionPingError, `Ping command failed: ${error.message}`);
return "connecting";
}

Copilot uses AI. Check for mistakes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, I'm bubbling up the error

Comment on lines 122 to 123
for (let i = 0; i < 600; i++) {
// try for 5 minutes
Copy link
Preview

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract the retry count (600) and delay (500ms) into named constants to improve readability and ease future adjustments.

Suggested change
for (let i = 0; i < 600; i++) {
// try for 5 minutes
for (let i = 0; i < RETRY_COUNT; i++) {
// try for RETRY_COUNT attempts

Copilot uses AI. Check for mistakes.

src/logger.ts Outdated
@@ -17,6 +17,8 @@ export const LogId = {
atlasDeleteDatabaseUserFailure: mongoLogId(1_001_002),
atlasConnectFailure: mongoLogId(1_001_003),
atlasInspectFailure: mongoLogId(1_001_004),
atlasConnectAttempt: mongoLogId(1_001_005),
atlasConnectSuccessed: mongoLogId(1_001_006),
Copy link
Preview

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log ID key atlasConnectSuccessed is misspelled; consider renaming it to atlasConnectSucceeded or atlasConnectSuccess for clarity.

Suggested change
atlasConnectSuccessed: mongoLogId(1_001_006),
atlasConnectSucceeded: mongoLogId(1_001_006),

Copilot uses AI. Check for mistakes.

Comment on lines 108 to 110
const connectionString = cn.toString();

return connectionString;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const connectionString = cn.toString();
return connectionString;
return cn.toString();

Comment on lines 148 to 149
groupId: this.session.connectedAtlasCluster?.projectId || "",
username: this.session.connectedAtlasCluster?.username || "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those are not set, does it make sense to make that call at all?

@fmenezes fmenezes requested a review from nirinchev July 8, 2025 13:44
@fmenezes
Copy link
Collaborator Author

fmenezes commented Jul 8, 2025

@nirinchev this is ready for another look

Copy link
Collaborator

@nirinchev nirinchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - let's add some clarifying comments (feel free to use my suggestion or reword them) and fix the retry attempts for the "connects to cluster" test.

}

protected async execute({ projectId, clusterName }: ToolArgs<typeof this.argsShape>): Promise<CallToolResult> {
const connectingResult = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to have this at the top of the function if it's only used at the last catch clause?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is used in two cases, the last return and a switch case early on case "connecting"

const connectionString = await this.prepareClusterConnection(projectId, clusterName);

try {
await this.connectToCluster(connectionString, 60);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some comments that could guide readers - e.g.:

Suggested change
await this.connectToCluster(connectionString, 60);
// First, try to connect to the cluster within the current tool call.
// We give it 60 attempts with 500 ms delay between each, so ~30 seconds
await this.connectToCluster(connectionString, 60);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

`error connecting to cluster: ${error.message}`
);

process.nextTick(async () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
process.nextTick(async () => {
// We couldn't connect in ~30 seconds, likely because user creation is taking longer
// Retry the connection with longer timeout (~5 minutes), while also returning a response
// to the client. Many clients will have a 1 minute timeout for tool calls, so we want to
// return well before that.
//
// Once we add support for streamable http, we'd want to use progress notifications here.
process.nextTick(async () => {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

expect(response.content).toBeArray();
expect(response.content).toHaveLength(1);
expect(response.content[0]?.text).toContain(`Connected to cluster "${clusterName}"`);
for (let i = 0; i < 600; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we added the 30 second retry logic in the connect tool, this may be a bit excessive - worst case scenario, this will result in 5 hours of waiting for the test to fail.

Copy link
Collaborator

@gagik gagik Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect jest has some default test timeouts. maybe we can set an explicit timeout here and turn this into a while loop or create a waitFor / retry helper

Copy link
Collaborator Author

@fmenezes fmenezes Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not the case, on subsequent calls we don't try for 30 secs, we know there is a background process running so we return Attempting ... message straight away.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to reflect what we discussed offline, now we always wait 30 secs, adjusted the test to 10 times only

`error connecting to cluster: ${error.message}`
);

process.nextTick(async () => {
Copy link
Collaborator

@gagik gagik Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the benefit of using process.nextTick here?
There's no problematic operation that'd take precedence I can think of that would justify its usage here.

Copy link
Collaborator Author

@fmenezes fmenezes Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is my bad, when I use to work with node we didn't have/use promises (10 yrs ago) so process.nextTick was used for async operations, now-a-days void somePromise is much handier and I keep forgetting

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

case "unknown":
default:
await this.session.disconnect();
const connectionString = await this.prepareClusterConnection(projectId, clusterName);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix this, though it seems like we can move it entirely into connectToCluster, can't we?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we can't - we need to await this before we continue because it's setting the connected cluster on the session. Guess we just need to wrap it in curly braces.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the tests and syntax

@fmenezes fmenezes enabled auto-merge (squash) July 10, 2025 12:07
@fmenezes fmenezes merged commit 27c52b4 into main Jul 10, 2025
18 checks passed
@fmenezes fmenezes deleted the mcp-46 branch July 10, 2025 12:10
nirinchev added a commit that referenced this pull request Jul 11, 2025
* main:
  chore: revoke access tokens on server shutdown [MCP-53] (#352)
  fix: turn atlas-connect-cluster async (#343)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

atlas-connect-cluster command randomly fails to connect
4 participants