Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meros's ULimit code doesn't support Windows and is untested. #158

Open
kayabaNerve opened this issue Mar 22, 2020 · 14 comments · Fixed by #169
Open

Meros's ULimit code doesn't support Windows and is untested. #158

kayabaNerve opened this issue Mar 22, 2020 · 14 comments · Fixed by #169
Labels
bug Something isn't working untested This issue, or its fix, needs testing to verify its accuracy

Comments

@kayabaNerve
Copy link
Member

kayabaNerve commented Mar 22, 2020

There's two problems here:

  1. There's a connection limit in the first place.
  2. The RPC becomes disabled.

EDIT: The original title was "Meros will accept 500 client connections and then refuse to handle any. ". Please read the full issue to see how this has evolved.

EDIT: The second title "Meros's network activity eventually hangs." has also been replaced.

@kayabaNerve kayabaNerve added bug Something isn't working critical This needs to be fixed ASAP labels Mar 22, 2020
@kayabaNerve
Copy link
Member Author

kayabaNerve commented Mar 24, 2020

This from an OSError that there's too many open files. As this is likely per OS, there's no good answer to how to solve this.

The three solutions are:
A) Find a magic way where accepting a socket doesn't cause "Too many open files", which isn't possible. This isn't a risk when opening sockets as we only peer find to find 8 sockets.
B) Set a socket cap.
C) Require users to override ulimit, still setting a cap at ~65536 connections.

The second isn't preferred, especially due to how we handle seed nodes. If the seed nodes become filled, no new peers could join the network. We'd need to add a HandshakeRejection network message combined which also sends a list of alternate peers.

@kayabaNerve
Copy link
Member Author

kayabaNerve commented Mar 24, 2020

The Ubuntu ulimit is 1024. It kills Meros off at 490-500 sockets (980-1000 files). At least on Macs, the default ulimit is 256. Leaving 44 files open for whatever else Meros is doing (and increasing to 64 to be sure), that leaves 192 files or 96 connections.

We should add a config value for this, as some servers can raise their ulimit further, but the config default should be 96. In order to handle nodes which are full, I'd like to propose Busy. A copy of Peers with a new message byte which is used in place of Handshake or Syncing`.

@kayabaNerve kayabaNerve changed the title Meros will accept 500 client connections and then refuse to handle any. Meros will accept ~500 client connections and then refuse to handle any due to ulimit. Mar 24, 2020
@kayabaNerve
Copy link
Member Author

Also, to be clear, we can deal with ulimit if we just skip exceptions when accepting sockets, which we already do, and use an unlimited file related ulimit on the seed nodes. The reason this is an issue is:

  1. Because it's RPC blocking.
  2. We don't only use files for sockets. We have a DB and co. If we can open those files at boot, then it'd be fine except for Review of SECP256K1Wrapper #1.
  3. Nodes can become overwhelmed so a Busy message is a good addition.

@kayabaNerve
Copy link
Member Author

To be clear, I actually have no idea what 44 files Meros is opening on its own, even when lowering the amount to 24 (low bound). The only files I know of are the 4 database files (DB/WalletDB) and two server sockets (peer server/RPC server). We need to run strace to see what else is creating files and make sure 64 is a proper allocation.

@kayabaNerve
Copy link
Member Author

Even with only 3 other peer connections, Meros will eventually refuse new connections. Something isn't closing its files.

@kayabaNerve
Copy link
Member Author

ulimit, combined with lingering file accesses, are causing nodes to technically remain running yet shut off their network activity. This isn't a crash but it is an unstable setup.

@kayabaNerve kayabaNerve changed the title Meros will accept ~500 client connections and then refuse to handle any due to ulimit. Meros's network activity eventually hangs. Mar 29, 2020
@kayabaNerve
Copy link
Member Author

For this issue to be closed:

  • Lingering files must be closed.
  • Meros needs a configurable connection limit.

@kayabaNerve
Copy link
Member Author

I believe this is because error'ed sockets leave their FDs open, yet I'm not sure.

@kayabaNerve
Copy link
Member Author

This may or may not be fixed in the chronos branch. It adds several close statements to errored sockets and stops the socket overwriting that was happening as part of #159. That said, due to the same issue, I can't currently test it. I just wanted to note the progress.

@kayabaNerve
Copy link
Member Author

kayabaNerve commented Apr 5, 2020

https://github.com/MerosCrypto/Meros/tree/ulimit has been created in response to this issue. It handles the max file limit, amount of open files, and whether or not new sockets can be created. It also adds a Busy message, currently undocumented, identical to Peers. It can be sent in response to any initial Handshake/Syncing.

Before it can be merged, it needs documentation of the Busy message and thorough testing. As of right now, it also doesn't support Windows, but I plan on merging it Windows support (or even confirmed BSD/macOS support), creating a new issue for Windows support/verified BSD and macOS support.

If this is properly implemented, combined with verifying there's no lingering sockets, we should have a stable net that, even when the seed nodes are at capacity, will still let new nodes join. Then we can merge this close this issue.

@kayabaNerve
Copy link
Member Author

The ulimit code works on Linux, yet is untested on macOS/BSD. Meros will go up to however many peers ulimit allows, and then start sending Busy. At this time:

That said, despite having theoretical macOS/BSD support, we don't have Windows support. It should be implementable via WinAPI's GetProcessHandleCount.

This will be merged, and moved to bug/untested, when a test for the second case is added. Bug will remain since BSD is only theoretically supported and Windows is unsupported. I'll try to spin up a FreeBSD VM some time.

@kayabaNerve
Copy link
Member Author

Alrighty. The test for the second case was added and passes. I'm running the tests locally, but then I'll merge this, re-title this issue, change the labels, and remove it from the milestone.

@kayabaNerve kayabaNerve mentioned this issue Apr 6, 2020
@kayabaNerve kayabaNerve reopened this Apr 6, 2020
@kayabaNerve kayabaNerve added untested This issue, or its fix, needs testing to verify its accuracy and removed critical This needs to be fixed ASAP labels Apr 6, 2020
@kayabaNerve
Copy link
Member Author

We need to implement Windows support for the FileLimitTracker.

We need to test:

  • Said Windows support.
  • macOS/BSD support (I'm fine with solely testing FreeBSD and counting it for both).
  • Whether or not Busy messages generated by Meros include peers.
  • What happens when enough RPC sockets are used the buffer files run out of room.

It should be noted for that last bullet point, we either need to cap the amount of RPC sockets OR start disconnecting peers to regain the buffer.

As for testing if we still leak sockets, which should've been fixed with Chronos, I feel the best way to test that will be to start up a net and see what happens.

@kayabaNerve kayabaNerve changed the title Meros's network activity eventually hangs. Meros's ULimit code doesn't support Windows and is untested. Apr 6, 2020
@kayabaNerve kayabaNerve removed this from the Developer Testnet 4 Stable milestone May 15, 2020
@kayabaNerve
Copy link
Member Author

This does compile on macOS, yet I have no idea if it works as expected. The user who compiled it didn't run the tests as well, and I didn't press them as they aren't especially tech savvy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untested This issue, or its fix, needs testing to verify its accuracy
Development

Successfully merging a pull request may close this issue.

1 participant