Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3D hang on Frontier #1683

Open
syamajala opened this issue Apr 10, 2024 · 1 comment
Open

S3D hang on Frontier #1683

syamajala opened this issue Apr 10, 2024 · 1 comment
Labels

Comments

@syamajala
Copy link
Contributor

syamajala commented Apr 10, 2024

I am seeing S3D hang at 8 nodes (2 ranks/node) on Frontier after 10 timesteps. It does not look like any threads are making progress. I am running with all of @elliottslaughter flags.

There are some stack traces here: http://sapling2.stanford.edu/~seshu/s3d_tdb/frontier/stacktraces/

@syamajala syamajala added the S3D label Apr 10, 2024
@elliottslaughter
Copy link
Contributor

I've been reviewing this with Seshu. The symptoms are identical to what I was seeing at 8192 nodes on Frontier, but it happens at dramatically smaller node counts. I don't think I've ever seen a network freeze below 128 nodes, let alone 8.

The network variables check out and should be correct for the configuration Seshu is running.

The stack traces all appear to be effectively empty, which is consistent with what I was seeing.

The CXI debug logging doesn't print anything meaningful, which is also consistent with what I was seeing.

We checked the NIC binding and it's fine.

I don't know what else to say. These runs seem to be doing all the right things, but they're freezing anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants