-
Notifications
You must be signed in to change notification settings - Fork 72
fix: turn atlas-connect-cluster async #343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 16194572196Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 16123662219Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the atlas-connect-cluster
tool into an asynchronous background process and updates integration tests to poll until the cluster connection is established.
- Tests now loop with retries to wait for the async connection to succeed.
ConnectClusterTool
has been split into query, prepare, and background connect phases, returning immediately with an “Attempting…” message.- Added new log IDs for connect attempts and successes in
logger.ts
.
Comments suppressed due to low confidence (1)
tests/integration/tools/atlas/clusters.test.ts:195
- Add an assertion after the loop (or inside the success branch) to fail the test if the cluster never reports "Cluster is already connected.", otherwise the test may silently pass without verifying a successful connection.
for (let i = 0; i < 600; i++) {
await this.session.serviceProvider.runCommand("admin", { | ||
ping: 1, | ||
}); | ||
return "connected"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap the runCommand('admin', { ping: 1 })
call in a try/catch so transient ping errors don’t bubble up and trigger a full reconnection flow prematurely.
await this.session.serviceProvider.runCommand("admin", { | |
ping: 1, | |
}); | |
return "connected"; | |
try { | |
await this.session.serviceProvider.runCommand("admin", { | |
ping: 1, | |
}); | |
return "connected"; | |
} catch (error) { | |
logger.warn(LogId.ConnectionPingError, `Ping command failed: ${error.message}`); | |
return "connecting"; | |
} |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need, I'm bubbling up the error
for (let i = 0; i < 600; i++) { | ||
// try for 5 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract the retry count (600) and delay (500ms) into named constants to improve readability and ease future adjustments.
for (let i = 0; i < 600; i++) { | |
// try for 5 minutes | |
for (let i = 0; i < RETRY_COUNT; i++) { | |
// try for RETRY_COUNT attempts |
Copilot uses AI. Check for mistakes.
src/logger.ts
Outdated
@@ -17,6 +17,8 @@ export const LogId = { | |||
atlasDeleteDatabaseUserFailure: mongoLogId(1_001_002), | |||
atlasConnectFailure: mongoLogId(1_001_003), | |||
atlasInspectFailure: mongoLogId(1_001_004), | |||
atlasConnectAttempt: mongoLogId(1_001_005), | |||
atlasConnectSuccessed: mongoLogId(1_001_006), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log ID key atlasConnectSuccessed
is misspelled; consider renaming it to atlasConnectSucceeded
or atlasConnectSuccess
for clarity.
atlasConnectSuccessed: mongoLogId(1_001_006), | |
atlasConnectSucceeded: mongoLogId(1_001_006), |
Copilot uses AI. Check for mistakes.
const connectionString = cn.toString(); | ||
|
||
return connectionString; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const connectionString = cn.toString(); | |
return connectionString; | |
return cn.toString(); |
groupId: this.session.connectedAtlasCluster?.projectId || "", | ||
username: this.session.connectedAtlasCluster?.username || "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those are not set, does it make sense to make that call at all?
@nirinchev this is ready for another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - let's add some clarifying comments (feel free to use my suggestion or reword them) and fix the retry attempts for the "connects to cluster" test.
} | ||
|
||
protected async execute({ projectId, clusterName }: ToolArgs<typeof this.argsShape>): Promise<CallToolResult> { | ||
const connectingResult = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to have this at the top of the function if it's only used at the last catch clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is used in two cases, the last return and a switch case early on case "connecting"
const connectionString = await this.prepareClusterConnection(projectId, clusterName); | ||
|
||
try { | ||
await this.connectToCluster(connectionString, 60); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some comments that could guide readers - e.g.:
await this.connectToCluster(connectionString, 60); | |
// First, try to connect to the cluster within the current tool call. | |
// We give it 60 attempts with 500 ms delay between each, so ~30 seconds | |
await this.connectToCluster(connectionString, 60); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
`error connecting to cluster: ${error.message}` | ||
); | ||
|
||
process.nextTick(async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
process.nextTick(async () => { | |
// We couldn't connect in ~30 seconds, likely because user creation is taking longer | |
// Retry the connection with longer timeout (~5 minutes), while also returning a response | |
// to the client. Many clients will have a 1 minute timeout for tool calls, so we want to | |
// return well before that. | |
// | |
// Once we add support for streamable http, we'd want to use progress notifications here. | |
process.nextTick(async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
expect(response.content).toBeArray(); | ||
expect(response.content).toHaveLength(1); | ||
expect(response.content[0]?.text).toContain(`Connected to cluster "${clusterName}"`); | ||
for (let i = 0; i < 600; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we added the 30 second retry logic in the connect tool, this may be a bit excessive - worst case scenario, this will result in 5 hours of waiting for the test to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect jest has some default test timeouts. maybe we can set an explicit timeout here and turn this into a while loop or create a waitFor
/ retry
helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not the case, on subsequent calls we don't try for 30 secs, we know there is a background process running so we return Attempting ...
message straight away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to reflect what we discussed offline, now we always wait 30 secs, adjusted the test to 10 times only
`error connecting to cluster: ${error.message}` | ||
); | ||
|
||
process.nextTick(async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the benefit of using process.nextTick
here?
There's no problematic operation that'd take precedence I can think of that would justify its usage here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is my bad, when I use to work with node we didn't have/use promises (10 yrs ago) so process.nextTick was used for async operations, now-a-days void somePromise
is much handier and I keep forgetting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
case "unknown": | ||
default: | ||
await this.session.disconnect(); | ||
const connectionString = await this.prepareClusterConnection(projectId, clusterName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should fix this, though it seems like we can move it entirely into connectToCluster
, can't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we can't - we need to await this before we continue because it's setting the connected cluster on the session. Guess we just need to wrap it in curly braces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the tests and syntax
fixes #321
Proposed changes
turn
atlas-connect-cluster
tool into an async tool. Unfortunately, rely on atlas user creation api to connect, sometimes the propagation time of the user from control plane to the data plane can take time, there is no state to query other than check if user has access to DB.Checklist