Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NALU] mesh 256 wl fails on 6144 MPI ranks #6

Open
teabagk7 opened this issue Feb 15, 2021 · 10 comments
Open

[NALU] mesh 256 wl fails on 6144 MPI ranks #6

teabagk7 opened this issue Feb 15, 2021 · 10 comments

Comments

@teabagk7
Copy link

what(): 107: <....>/Trilinos_2/packages/zoltan2/core/src/problems/Zoltan2_PartitioningSolution.hpp,1572 107: error: Value for num_global_parts is different on different processes

192, 384,768, 1536, 3072 - works fine, no such error.

mesh 512 works on 768, 1536, 3072 and 6144!

@chchang6
Copy link

@teabagk7 Thanks for reporting this, I've forwarded to the developer and will update ASAP.

@chchang6
Copy link

@teabagk7 The 256 mesh, 6144-rank run tests out on our system with the reference commit (see the Nalu README for hashes). We will accept results from different commits, since we recognize how much work is required to generate the results you already have. We suggest building the older version of the code to generate the 256 mesh 6144-rank results.

@teabagk7
Copy link
Author

I've built exact the same hashes of Trilinos and Nalu.
This problem appears only on mesh 256 test with 96 nodes.

@chchang6
Copy link

I didn't mention a Trilinos hash. The two hashes we mention in the README are of Nalu code. Runs at 6144 ranks and the 256 mesh run to completion on our reference hardware.

What Nalu hash are you working with?

@teabagk7
Copy link
Author

Nalu-Wind Version: v1.2.0
Nalu-Wind GIT Commit SHA: c7c3723261cf1eebe73ef969396d08d342a01644-DIRTY
Trilinos Version: 13.1-g53550bee94b
TPLs: Boost, HDF5, netCDF, STK, Trilinos, yaml-cpp and zlib

@chchang6
Copy link

Try Nalu-Wind commit 1d3ee2e62ecdd4745d0339a5bf9c5194a07bc93a for the 256 mesh, 6144-rank test.

@gcstoianowski
Copy link

Try Nalu-Wind commit 1d3ee2e62ecdd4745d0339a5bf9c5194a07bc93a [...]

[gerardo@login01 build-test]$ git checkout 1d3ee2e62ecdd4745d0339a5bf9c5194a07bc93a
fatal: reference is not a tree: 1d3ee2e62ecdd4745d0339a5bf9c5194a07bc93a

@chchang6
Copy link

[cchang@el1 cchang]$ git clone https://github.com/Exawind/nalu-wind.git
Cloning into 'nalu-wind'...
remote: Enumerating objects: 69, done.
remote: Counting objects: 100% (69/69), done.
remote: Compressing objects: 100% (56/56), done.
remote: Total 25671 (delta 22), reused 36 (delta 13), pack-reused 25602
Receiving objects: 100% (25671/25671), 17.46 MiB | 14.71 MiB/s, done.
Resolving deltas: 100% (20518/20518), done.
[cchang@el1 cchang]$ cd nalu-wind/
[cchang@el1 nalu-wind]$ git checkout 1d3ee2
Note: checking out '1d3ee2'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

git checkout -b new_branch_name

HEAD is now at 1d3ee2e... Updating golds in response to #692.

@gcstoianowski
Copy link

Thank you. I was using 'git clone https://github.com/exawind/build-test.git', which I got from Step 4 of https://nalu-wind.readthedocs.io/en/latest/source/user/build_spack.html

@chchang6
Copy link

OK, thanks @gcstoianowski . I'll forward to the benchmark steward to see if we can't clarify the instructions on our end a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants