-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate assignment loops for large preintegration tables #25
Conversation
FFCX is much stricter than FFC with the flake8 tests. |
I'm a little bit confused about some flake8 errors: "undefined name 'block_rank'" in lines 985 and 988 ffcx/ffc/uflacs/integralgenerator.py Line 985 in 5f6d990
In line 981: ffcx/ffc/uflacs/integralgenerator.py Line 981 in 5f6d990
the definition of the variable is indeed commented out. In my original commit this was not the case. Did this happen accidentally? |
Possibly - feel free to uncomment. FFC was in very poor shape re static code checking so some aggressive linting was needed - more aggressive than one would normally wish. |
No problem. |
Fixes for flake8 errors
@blechta Could you look at fixing the merge conflicts, or close the PR? |
We discussed that it's probably more sensible to catch the issue that motivated this PR at an earlier stage which might result in much less code changes. However, I did not look into this yet. |
Resolved conflicts: ffc/uflacs/integralgenerator.py
Opened #38 for discussion |
I cleaned the code a bit. Now there is a parameter |
The clean up is nice. I'll look at it again tomorrow, but there are currently two problems with the fix in general:
Together with these fixes it makes sense to rethink the current loop assignment approach as I mentioned in the issue (e.g. to use DOF map for sparse matrix like assignment like |
Did you confirm that it does not work by running the code? This is what tsfc generates (for UFC interface, Firedrake kernels do not zero the tensor) and that runs fine in minidolfin which is C, not C++.
|
I pushed the fix for 2. but not tested. I think now it's pretty straightforward. Memset not needed anymore, although I believe it works. |
Sounds good, I'll check tomorrow. |
Ah, |
This fixes the mixed unrolled/looped case but the code is difficult to follow. Refactor!
|
Nice. I guess the results look good, the generated code probably less so ;) |
Even less code duplictation for preintegrated blocks
ffc/uflacs/integralgenerator.py
Outdated
# Index the static preintegrated table: | ||
P_ii = P_entity_indices + P_arg_indices | ||
A_rhs = f * PI[P_ii] | ||
if blockdata.unroll: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we could have just if blockdata.unroll: continue
in the beginning of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch.
Skip non-unrolled preintegrated blocks in unrolled code generation
Remove unnecessary line break
We could test with minidolfin for dependence of GCC behaviour on |
I thought about opening a PR at least for the FFC support if it's ok that it targets FFC-X? |
Definitely FFC-X. (You could open also WIP PR with cross-cell parallelization. It would be good to see how disruptive the change is.) |
Actually, I did not find an element with pre-integration table sizes between 35^2 (e.g. CG_7(2D), works fine with GCC) and 44^2 (e.g. NED^1_3(3D), CG_8(2D), causes the issue). Do you have any ideas @blechta? Otherwise one could just use |
Revert commits
Should we be rather on more safe side, i.e., trying to prevent the catastrophe for the case we don't know how they behave, i.e., setting the lower value? What about quite arbitrary 1536? Sufficient granularity to test could give a mass matrix on |
Ah, good idea, I'll try using the vector element. |
@blechta I think the vector element dimension only effects the element matrix size, not the PI table size (probably reusing the same PI table for every component?) |
@w1th0utnam3 Yeah, makes sense, table are reused. One could go with |
Ok, I have to experiment a little bit more. Your last suggestion works and is useful to generate PI tables of arbitrary size. However, it turns out that this isn't really the problem. E.g. for mass matrix for Lagrange basis functions, a single huge PI table is generated. In the generated code, every line only contains a single assignment and GCC can compile it without a problem. In contrast, for the curl curl problem with Nedelec elements, the problem is that there are 45 PI tables of size 44x44 each, which results in huge assignment statements. |
Good digging. We have to check a bit more. |
So do I understand it correctly that long assignments are a problem? Do we know how to decide that at the IR stage? |
All information should be available but I guess that it's not super trivial to check. So the problem occurs when you have many overlapping blocks for a single entry of the cell matrix. However, I don't know if the issue already occurs when there is a single huge assignment or if there have to be many of them. But in general the solution would probably go in the direction of counting the number of block overlaps per entry in the cell matrix? And then maybe something like
and then we have the two parameters |
Closing for now - can revisit later once major restructuring is done. |
Fixes https://bitbucket.org/fenics-project/ffc/issues/173/uflacs-preintegration-emits-inefficient
Fix for issue 173: GCC (<= 7.3.0) eating all available memory in
optimization of the C++ code generated for an example with 45x45 element
tensor (mentioned in the issue). See the issue for more details.
Summary of changes: Modified UFLACS code generation to output loops
instead of linear assignments for forms with large pre-integration
tables (> 1024 entries).