You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The gatekeeper queries Mongo to verify an API key is valid. This query is sometimes failing to occur. This was leading to a user being denied even if they supplied a valid key. The problem only cropped up rarely, so it wasn't widely seen, which explains why it's only been discovered now. I have an ugly workaround currently implemented that seems to address the problem, but this deserves more investigation, since the workaround is not ideal and not performant.
What's happening: each gatekeeper proxy process holds open a persistent connection to Mongo that gets re-used across all the requests served by that process. The issue crops up when that persistent connection is randomly terminated. There's then a brief period of time when the mongo client isn't aware that it has a terminated connection, so it's subsequent queries fail, until the connection reconnects.
This may be related to the hosting environment and network or firewall settings that lead to the disconnects: https://support.mongolab.com/entries/23009358-handling-dropped-connections-on-windows-azure However, my attempts at fixing it with keepalive settings have been unsuccessful. The network nature of the problem also probably explains why this has never cropped up in unit tests or other local environments where mongo is on the same machine as the gatekeeper.
The workaround I have in place (see NREL/api-umbrella-gatekeeper@ff9da2a and the couple subsequent commits) basically just keeps retrying the mongo query every 50 ms for up to 100 times. Some retry mechanism may be needed, but the amount of these retries that are currently necessary make no sense to me. In one environment where this is a problem, I can see it make up to 60 or 70 retries before finally succeeding. With the wait time in between each retry, this adds a somewhat significant amount of time to the request if a user happens to be the super-unlucky one to hit it when this connection drops.
The text was updated successfully, but these errors were encountered:
The gatekeeper queries Mongo to verify an API key is valid. This query is sometimes failing to occur. This was leading to a user being denied even if they supplied a valid key. The problem only cropped up rarely, so it wasn't widely seen, which explains why it's only been discovered now. I have an ugly workaround currently implemented that seems to address the problem, but this deserves more investigation, since the workaround is not ideal and not performant.
What's happening: each gatekeeper proxy process holds open a persistent connection to Mongo that gets re-used across all the requests served by that process. The issue crops up when that persistent connection is randomly terminated. There's then a brief period of time when the mongo client isn't aware that it has a terminated connection, so it's subsequent queries fail, until the connection reconnects.
This may be related to the hosting environment and network or firewall settings that lead to the disconnects: https://support.mongolab.com/entries/23009358-handling-dropped-connections-on-windows-azure However, my attempts at fixing it with keepalive settings have been unsuccessful. The network nature of the problem also probably explains why this has never cropped up in unit tests or other local environments where mongo is on the same machine as the gatekeeper.
The workaround I have in place (see NREL/api-umbrella-gatekeeper@ff9da2a and the couple subsequent commits) basically just keeps retrying the mongo query every 50 ms for up to 100 times. Some retry mechanism may be needed, but the amount of these retries that are currently necessary make no sense to me. In one environment where this is a problem, I can see it make up to 60 or 70 retries before finally succeeding. With the wait time in between each retry, this adds a somewhat significant amount of time to the request if a user happens to be the super-unlucky one to hit it when this connection drops.
The text was updated successfully, but these errors were encountered: