-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate non-scaling memory consumption to enable iron for High Performance Computing #112
Comments
To fix the memory issue the following steps should be taken:
A first attempt for the first three bullet points was started with issue #61 but the resulting code was deleted. |
This repo implements setup of the domain mappings for elements, nodes and dofs without GLOBAL_TO_LOCAL_MAP. For determining the boundary and ghost numbers a lot of local MPI communication is involved and global communication is tried to be avoided. Elements mappingIf the decomposition is of type CMFE_DECOMPOSITION_CALCULATED_TYPE Parmetis is used to do the element decomposition. If it is CMFE_DECOMPOSITION_USER_DEFINED_TYPE the user code can set the domain for each element with cmfe_Decomposition_ElementDomainSet. This routine can be called asynchronous over the processes, a process does not necessarily specify all nor all of their own elements' domains. In total the information for each element to be on which domain has to be specified by at least one process which can be any process. Nodes mappingThe internal nodes are decomposed like the adjacent elements. The potential boundary nodes are distributed by a heuristic such that the total number of nodes per process is at most equal. Dofs mappingThe dofs numbering follows the global nodes numbering and requires a large Allgatherv because the global dof numbers on a process depend on the local numbers of dofs on all previous processes. The pull request does not delete anything of the old implementation. The new code in
It works if one sets either one and both of the global variables An example with test cases for the new code can be found here (based on the laplace example) Note that the implementation was started from the Stuttgart version of iron and thus contains a lot of other changes that are unrelated to the mappings. |
Currently iron cannot be used for highly parallel simulations on more than ~10 cores (à 24 ranks).
That's because of the memory consumption which rises linearly instead of being constant in a weak-scaling scenario (i.e. number of processes and problem size increased in the same way).
The reason behind this is the way of numbering various items (elements, nodes, dofs, matrix columns etc.) using a GLOBAL_TO_LOCAL map each. This array contains information about the whole problem, on every process. It is mainly used for the setup of the domain_mappings but also accessed afterwards.
Another not scaling data structure is the mesh topology which is used to define the mesh from the user code. It contains a global list of all elements, nodes etc. Once the decomposition is created one can deallocate the mesh, so the hope is that this is not the limiting bottleneck for memory consumption. However it contributes to the asymptotic scaling proportional to problem size despite decomposition and will sooner or later be an issue.
The text was updated successfully, but these errors were encountered: