New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI_ERR_TAG: invalid tag error #206
Comments
Are you able to identify which MPI_Irecv call is giving you this error? I haven't seen tag values go beyond the allowed MPI upper bound, but I do see some cases where the tag value is unexpectedly large during the execution of BergerRigoutsos, so I'm taking a look at that. |
We have not yet identified the specific call. |
I found a past report from a user that was running out of valid tags for BergerRigoutsos, which turned out to be a result of their MPI installation having an unusually small value for MPI_TAG_UB. We didn't fix anything in SAMRAI in that case, as their solution was to use a better MPI installation. Other symptoms in their case don't match what you report, so I doubt this is what is causing your error, but you can check your MPI_TAG_UB value using a call to MPI_Attr_get(). In their case the error happened immediately instead of emerging over time, and TileClustering worked even with the small MPI_TAG_UB value. I will keep checking to see if I can find anywhere that our calculated values for the MPI tag would tend to grow over time. |
ok, I ran this: #include <mpi.h>
#include <iostream>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int NoOfProcess;
int ProcessNo = 0;
int *tagUB, err;
MPI_Comm_rank(MPI_COMM_WORLD, &ProcessNo);
MPI_Comm_size(MPI_COMM_WORLD, &NoOfProcess);
MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_TAG_UB, &tagUB, &err);
std::cout << ProcessNo << " among " << NoOfProcess
<< " and tagUb = " << *tagUB << " err = " << err << "\n";
MPI_Finalize();
} on my machine with openmpi 4.1.1 :
and got :
did it again on the local cluster I'm running tests I mention in this issue with Open MPI 4.1.0 and got :
with a different version they propose (Intel(R) MPI Library for Linux* OS, Version 2019 Update 9 Build 20200923) I get :
So it turns out that indeed the upper bound tag is significantly lower on the cluster with perform tests on... I'll see with the admins why that is the case. My machine is a fedora workstation with a packaged openmpi install, nothing fancy done there, so I'm a bit puzzled why the cluster versions would have much lower tag limits. In any case, It's still a bit weird that tags tend to only become invalid after a serious number of step. One thing I should maybe say here, is that I observe in my current test runs that SAMRAI seems to tag and regrid the hierarchy every coarsest times or even more often for some levels. That seems excessive and is probably not the canonical way it is supposed to use (typically I would think one wants to tag at a pace that kind of depends on how fast the solution evolves, and I think I understood the times at which tagging occurs should be specified in the inputs of the StandardTagAndInitialize instance ?). This probably results in lots of calls to the load balancer (because the clustering does not seem to be the source of the problem I blame the balancer !) and ends up increasing some tags more than they would be supposed to in a "normal" usage... |
I found a place where SAMRAI is computing strictly increasing tag values where I think that reusing constant values will be entirely safe. It is possible that these values are eventually reaching the upper bound on your systems with smaller upper bounds. #209 has a preliminary fix, if you would like to try it. |
No more crash with the proposed fix. |
Hi,
We're currently getting all of our runs failing with this error:
The code works for thousands of time steps on a tagged simulation with 3 levels.
They crash with this error message all at about the same time.
These tests models run on 20 cores at the moments.
Not sure to what extent this is related, but we use the TreeLoadBalancer and BergerRigoutsos at the moment.
I have seen in SAMRAI's code base that some tags are computed, like
SAMRAI/source/SAMRAI/mesh/BalanceUtilities.cpp
Lines 2301 to 2302 in aaa3853
A priori to be valid, tags should be between 0 and MPI_UB_TAG (which I don't know what the value is but probably larger than any tag you would like to compute).
Have you already experienced such issue and/or have some idea of what could go wrong or what thing could be worth investigating?
Thanks
The text was updated successfully, but these errors were encountered: