Skip to content

Reduce number of file descriptors used on interchange from manager connections  #3022

@benclifford

Description

@benclifford

Is your feature request related to a problem? Please describe.
Right now, the interchange needs two file descriptors per manager (one for sending tasks and heartbeats, and another for sending results).

There are some reasons why it would be interesting to consider using only one:

  • about once a year I encounter someone who runs out of file descriptors on their interchange: for example, one user has a restriction of 4096 fds on their submit node. This restricts them to around 2000 workers, but using one fd per manager would double that to around 4000 workers.

  • as documented in Non-deterministic hang in CI local tests due to combination of several existing issues. #3019 sometimes a half-connection can exist where only one of the ports connects successfully. This is awkward to debug

I think this history of these separate connections is due to informal queuing theory analysis of tasks flowing through the system vs results flowing in the other direction through the system, under load. I'm not convinced that the way the code is now that these two flows need two separate TCP connections, given the tight coupling under load between receiving a result and sending a new task. Specifically, I think it should be on the htex developers to properly (with actual theory) justify the need for two sockets.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions