Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MongoServerSelectionError exception happen periodically when change stream code is added #11275

Closed
newbaluku opened this issue Jan 25, 2022 · 22 comments
Assignees
Labels
docs This issue is due to a mistake or omission in the mongoosejs.com documentation
Milestone

Comments

@newbaluku
Copy link

Do you want to request a feature or report a bug?
bug

What is the current behavior?
when change stream code is included during connection to mongoDB, the AWS Lambda container would crashed periodically with Uncaught exception:

Uncaught Exception {
    "errorType": "MongoServerSelectionError",
    "errorMessage": "connection <monitor> to 52.221.162.36:27017 timed out",
    "reason": {
        "type": "ReplicaSetNoPrimary",
        "servers": {},
        "stale": false,
        "compatible": true,
        "heartbeatFrequencyMS": 10000,
        "localThresholdMS": 15,
        "setName": "atlas-xhdnq3-shard-0",
        "maxSetVersion": 2,
        "maxElectionId": "7fffffff000000000000001c",
        "commonWireVersion": 9
    },
    "stack": [
        "MongoServerSelectionError: connection <monitor> to 52.221.162.36:27017 timed out",
        "    at Timeout._onTimeout (/var/task/node_modules/mongodb/lib/sdam/topology.js:330:38)",
        "    at listOnTimeout (internal/timers.js:554:17)",
        "    at processTimers (internal/timers.js:497:7)"
    ]
}

OR (I believe older version of Mongoose reported this)

Uncaught Exception {
    "errorType": "MongoServerSelectionError",
    "errorMessage": "connection timed out",
    "reason": {
        "type": "ReplicaSetNoPrimary",
        "servers": {},
        "stale": false,
        "compatible": true,
        "heartbeatFrequencyMS": 10000,
        "localThresholdMS": 15,
        "setName": "atlas-xhdnq3-shard-0",
        "maxSetVersion": 2,
        "maxElectionId": "7fffffff000000000000001c",
        "commonWireVersion": 9
    },
    "stack": [
        "MongoServerSelectionError: connection timed out",
        "    at Timeout._onTimeout (/var/task/node_modules/mongodb/lib/sdam/topology.js:325:38)",
        "    at listOnTimeout (internal/timers.js:554:17)",
        "    at processTimers (internal/timers.js:497:7)"
    ]
}

without the change stream code, no exception will happen.

If the current behavior is a bug, please provide the steps to reproduce.
the above can be repro using sample Lambda project @ https://github.com/newbaluku/repro-mongodb-exception
This project has a 4 mins warmer running to keep the container warm and also a 35 mins timer to trigger DB query to simulate endpoint invocation. On cold boot, it will make DB connection as well as open a change stream for a collection.

What is the expected behavior?
exception should not happen with/without the change stream code.

What are the versions of Node.js, Mongoose and MongoDB you are using? Note that "latest" is not a version.
Node.js v12.22.7
Mongoose v6.1.8 (also repro in v6.1.7 and v6.0.12)
MongoDB v4.4.11

@Uzlopak
Copy link
Collaborator

Uzlopak commented Jan 26, 2022

@newbaluku
Copy link
Author

Seems to be related https://jira.mongodb.org/projects/NODE/issues/NODE-3894

I'm not too sure if they are related, because the above mentioned issue seems to happened when there is a change of MongoDB primary node. But that is not happening in my case. It happened even when there is no change of primary on MongoDB cluster.

@newbaluku
Copy link
Author

are you guys able to repro it using my sample?

@Uzlopak
Copy link
Collaborator

Uzlopak commented Jan 27, 2022

I never used lambda functions so I can't test that mvce. I assume that this is a mongodb native driver issue. Mongoose is just adding syntactical sugar over mongodb native driver. Probably the Bugtracker of mongdb is better place for your issue.

@IslandRhythms IslandRhythms added the has repro script There is a repro script, the Mongoose devs need to confirm that it reproduces the issue label Jan 27, 2022
@newbaluku
Copy link
Author

newbaluku commented Jan 28, 2022

perhaps someone on your team can help to repro this issue using my sample. I also think that the issue may be related to driver. But I'm not sure I could just blame the driver for that. Because the driver guy may come back to me and say that I didn't use the driver correctly and demand an explanation. This is the part that I need your expertise as a Mongoose developer to come in and advice. And it may also turn out that I might not be using Mongoose correctly and hence causing the issue.

So, I sincerely hope that the Mongoose team can help to look into this and help with facilitating a fix (in driver/Mongoose) or perhaps advice me on any possible workaround solution.

Thanks a lot.

@vkarpov15 vkarpov15 added this to the 6.1.12 milestone Jan 30, 2022
@IslandRhythms IslandRhythms added needs clarification This issue doesn't have enough information to be actionable. Close after 14 days of inactivity and removed has repro script There is a repro script, the Mongoose devs need to confirm that it reproduces the issue labels Feb 8, 2022
@IslandRhythms
Copy link
Collaborator

What do I type to get the repo going?

@newbaluku
Copy link
Author

What do I type to get the repo going?

repro steps:

you need to install serverless framework to deploy to AWS:
https://www.serverless.com/framework/docs/getting-started

after you get the code from repo,

goto project folder and run "npm install" to install all npm packages.
add your mongodb url at DATABASEURL in the file: helpers/db.js
run "sls deploy" to deploy to AWS Lambda.
you can then look for the exceptions that happens periodically in AWS CloudWatch logs.
Thanks for the repro.

@IslandRhythms
Copy link
Collaborator

Need an AWS account to repro @vkarpov15

@ErnestBrandi
Copy link

Seems related to an issue I also have : #11228

@vkarpov15
Copy link
Collaborator

@newbaluku I've got your sample running on my AWS account and successfully repro-ed your error:

2022-02-19 19:50:10.951	ERROR	Uncaught Exception 	{"errorType":"MongoServerSelectionError","errorMessage":"connection <monitor> to 52.87.119.4:27017 timed out","reason":{"type":"ReplicaSetNoPrimary","servers":{},"stale":false,"compatible":true,"heartbeatFrequencyMS":10000,"localThresholdMS":15,"setName":"Cluster0-shard-0","maxSetVersion":11,"maxElectionId":"7fffffff0000000000000081","commonWireVersion":13},"stack":["MongoServerSelectionError: connection <monitor> to 52.87.119.4:27017 timed out","    at Timeout._onTimeout (/var/task/node_modules/mongodb/lib/sdam/topology.js:330:38)","    at listOnTimeout (internal/timers.js:554:17)","    at processTimers (internal/timers.js:497:7)"]}
[ERROR] [1645318210952] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 400.
END Duration: 11.06 ms	Billed Duration: 12 ms	Memory Size: 1024 MB	Max Memory Used: 40 MB	

I'm adding some debugging to track this down, will post here when I find out more. Progress will likely be a little slow because, as written, this bug only seems to happen once every 35 minutes, so the feedback loop is pretty slow.

In the meantime, try reducing heartbeatFrequencyMS to 2000 like we did in #9179, I suspect this may be a similar issue.

@vkarpov15 vkarpov15 added confirmed-bug We've confirmed this is a bug in Mongoose and will fix it. and removed needs clarification This issue doesn't have enough information to be actionable. Close after 14 days of inactivity labels Feb 20, 2022
@newbaluku
Copy link
Author

Thanks @vkarpov15 for reproducing the bug.

I have just tried setting the heartbeatFrequencyMS: 2000 option during mongoose.connect(). Unfortunately, I can still repro the same bug, consistently and almost immediately when the database is queried.

based on my prelim observations, the frequency of it happening has increased after setting the option.

@newbaluku
Copy link
Author

Some additional info that may help in your debugging.

the bug doesn't seems to be reproducible when I deployed the same original code to run on Alibaba cloud Function compute (Lambda equivalent) that connects to ApsaraDB for MongoDB (MongoDB Atlas equivalent) that uses MongoDB version 4.2

@vkarpov15 vkarpov15 modified the milestones: 6.2.3, 6.2.4 Feb 21, 2022
@vkarpov15 vkarpov15 modified the milestones: 6.2.4, 6.2.5 Feb 28, 2022
@davideicardi
Copy link

davideicardi commented Mar 1, 2022

We have a similar problem after upgrading Mongoose from 5.11.13 to 6.2.1.
We have the same error periodically (but not always) when we run simple queries without using .lean().

When we use .lean() it works correctly.

Do you think it can be related?

@davideicardi
Copy link

davideicardi commented Mar 1, 2022

Some more info on my side: we are running mongoose inside an AWS Lambda and MongoDb Atlas cluster.

Also when using serverSelectionTimeoutMS=5s, as suggested in https://mongoosejs.com/docs/lambda.html, we get the MongoServerSelectionError error. If we leave the default, 30s, it seems to work but sometime we need to wait several seconds to connect. It seems that this happen only when lean() is not used. Still investigating ...

@vkarpov15
Copy link
Collaborator

@davideicardi if you're not using change streams, your issue is unrelated. It is hard to tell without looking at some code samples, but given that you're getting a ServerSelectionError if ServerSelectionTimeoutMS is 5000 but not if you leave it at 30000, I suspect you should be able to fix the "need to wait several seconds to connect" by reducing heartbeatFrequencyMS to say 3000.

@newbaluku what I'm seeing so far is that the change stream is getting confused by the warmer. It looks like the change stream starts a getMore() on one iteration of the warmer, runs server selection on the next iteration, and then reports an error on the 3rd iteration.

I suspect the issue comes down to race conditions with using a change stream as a stream in a Lambda. I don't have an exact answer about the particulars why yet, but change streams don't handle being interrupted mid getMore() very well. I'm running a workaround now, I'll see how it works and post here when I know more.

@newbaluku
Copy link
Author

@davideicardi if you're not using change streams, your issue is unrelated. It is hard to tell without looking at some code samples, but given that you're getting a ServerSelectionError if ServerSelectionTimeoutMS is 5000 but not if you leave it at 30000, I suspect you should be able to fix the "need to wait several seconds to connect" by reducing heartbeatFrequencyMS to say 3000.

@newbaluku what I'm seeing so far is that the change stream is getting confused by the warmer. It looks like the change stream starts a getMore() on one iteration of the warmer, runs server selection on the next iteration, and then reports an error on the 3rd iteration.

I suspect the issue comes down to race conditions with using a change stream as a stream in a Lambda. I don't have an exact answer about the particulars why yet, but change streams don't handle being interrupted mid getMore() very well. I'm running a workaround now, I'll see how it works and post here when I know more.

@vkarpov15 Thanks for the update and your efforts in debugging the issue.

@vkarpov15 vkarpov15 modified the milestones: 6.2.5, 6.2.6 Mar 9, 2022
@vkarpov15 vkarpov15 modified the milestones: 6.2.7, 6.2.8 Mar 11, 2022
@vkarpov15 vkarpov15 modified the milestones: 6.2.8, 6.2.9 Mar 23, 2022
@vkarpov15
Copy link
Collaborator

@newbaluku I think I found the right way to do this. I've been discussing with the MongoDB Node driver team and we both agree that running a change stream as a stream in the background of a Lambda function is bad practice. Too many potential places for Lambda to shut down the container in the middle of change stream logic and leave the change stream logic in a broken state.

You need a way to process a change stream explicitly in your handler, picking up where the previous change stream left off, and explicitly close the change stream when your handler is done. I've been running the below handler for a few hours and as far as I can tell it picks up changes correctly and without any unpredictable errors. Just need to fix #11527 to make this handler work without having to bypass Mongoose's change stream wrapper.

const warmer = require('lambda-warmer');

const Responses = require('./helpers/responses'),
      { connectToDatabase, getChangeStream } = require('./helpers/db'),
      { Country } = require('./model/country'),
      { logd, loge, logi } = require('./helpers/debuglog');

let resumeAfter = undefined;

module.exports.queryCountries = async (event, context) => {
  // add this so that we can re-use any static/global variables between function calls if Lambda
  // happens to re-use existing containers for the invocation.
  context.callbackWaitsForEmptyEventLoop = false;

  try {
    if (await warmer(event)) {
      logd('queryCountries> do nothing just warm up...');
      return;
    }

    console.time('connectToDatabase');
    await connectToDatabase();
    console.timeEnd('connectToDatabase');

    console.time('Country.findById');
    const found = await Country.findById('AU');
    console.timeEnd('Country.findById');

    // Use MongoDB Node driver's `watch()` function, because Mongoose change streams
    // don't support `next()` yet. See https://github.com/Automattic/mongoose/issues/11527
    const changeStream = await Country.collection.watch([], { resumeAfter });

    // Change stream `next()` will wait forever if there are no changes. So make sure to
    // stop listening to the change stream after a fixed period of time.
    let timeoutPromise = new Promise(resolve => setTimeout(() => resolve(false), 1000));
    let doc = null;
    while (doc = await Promise.race([changeStream.next(), timeoutPromise])) {
      logd('Got:' + JSON.stringify(doc, null, '  '));
    }

    // `resumeToken` tells you where the change stream left off, so next function instance
    // can pick up any changes that happened in the meantime.
    resumeAfter = changeStream.resumeToken;
    await changeStream.close();

    // Test code to make sure this function picks up previous function call's changes
    // logd('Run updateMany?');
    // const updateRes = await Country.updateMany({}, { $inc: { sentSms: 1 } });
    // logd('Updated', updateRes);

    logd('queryCountries> return...', resumeAfter);
    return { country: found };
  }
  catch (e) {
    loge('queryCountries> Catch', e);
    return Responses.internalServerError(e);
  }
};

@vkarpov15 vkarpov15 added docs This issue is due to a mistake or omission in the mongoosejs.com documentation and removed confirmed-bug We've confirmed this is a bug in Mongoose and will fix it. labels Mar 27, 2022
@vkarpov15
Copy link
Collaborator

I meant to make 9eaa0a9 reference this issue. I added some notes in our change streams docs about this approach.

@newbaluku
Copy link
Author

@vkarpov15 Thanks for your effort and updates. Will see how to incorporate this into my code.

@newbaluku
Copy link
Author

@vkarpov15 just to make sure that I understand this correctly.

It is not recommended to use ChangeStream with event emitter in Lambda functions. I should iterate the changeStream myself using the above code instead, right?

@newbaluku
Copy link
Author

@vkarpov15 just to make sure that I understand this correctly.

It is not recommended to use ChangeStream with event emitter in Lambda functions. I should iterate the changeStream myself using the above code instead, right?

@vkarpov15 any comments?

@vkarpov15
Copy link
Collaborator

@newbaluku sorry for missing this. This comment is mostly correct. If you use a change stream in a Lambda function, we recommend closing the change stream at the end of your function invocation, and creating a new change stream instance at the next invocation using refresh tokens to pick up where the last change stream left off. It is easier to do this by iterating the change stream imperatively using next(), which is why we recommend that approach. However, you can use the event emitter pattern as long as you remember to close the change stream when your Lambda function is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This issue is due to a mistake or omission in the mongoosejs.com documentation
Projects
None yet
Development

No branches or pull requests

6 participants