Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate assignment loops for large preintegration tables #25

Closed
wants to merge 31 commits into from

Conversation

blechta
Copy link
Member

@blechta blechta commented May 3, 2018

Fixes https://bitbucket.org/fenics-project/ffc/issues/173/uflacs-preintegration-emits-inefficient

Fix for issue 173: GCC (<= 7.3.0) eating all available memory in
optimization of the C++ code generated for an example with 45x45 element
tensor (mentioned in the issue). See the issue for more details.

Summary of changes: Modified UFLACS code generation to output loops
instead of linear assignments for forms with large pre-integration
tables (> 1024 entries).

@garth-wells
Copy link
Member

FFCX is much stricter than FFC with the flake8 tests.

@w1th0utnam3
Copy link
Contributor

I'm a little bit confused about some flake8 errors: "undefined name 'block_rank'" in lines 985 and 988

for i in range(block_rank))

In line 981:
# block_rank = len(blockmap)

the definition of the variable is indeed commented out. In my original commit this was not the case. Did this happen accidentally?

@garth-wells
Copy link
Member

garth-wells commented May 13, 2018

Possibly - feel free to uncomment.

FFC was in very poor shape re static code checking so some aggressive linting was needed - more aggressive than one would normally wish.

@w1th0utnam3
Copy link
Contributor

No problem.
See #28

@garth-wells
Copy link
Member

@blechta Could you look at fixing the merge conflicts, or close the PR?

@w1th0utnam3
Copy link
Contributor

We discussed that it's probably more sensible to catch the issue that motivated this PR at an earlier stage which might result in much less code changes. However, I did not look into this yet.

@w1th0utnam3
Copy link
Contributor

Opened #38 for discussion

@blechta
Copy link
Member Author

blechta commented Jul 6, 2018

I cleaned the code a bit. Now there is a parameter max_preintegrated_unrolled_table_size and the code should allow mixing of unrolling and loops, because unrolling and inlining is decided per block. Nevertheless have to find an example when it happens.

@w1th0utnam3
Copy link
Contributor

w1th0utnam3 commented Jul 6, 2018

The clean up is nice.

I'll look at it again tomorrow, but there are currently two problems with the fix in general:

  1. The looped code currently uses L.MemZero which maps to memset which was left over from C++, currently this generates a compiler error because in C it can be only used for strings (https://en.cppreference.com/w/c/string/byte/memset), the L.MemZero should be removed or replaced with something C compatible
  2. Mixing unrolled/looped code currently does not make sense because either ordering of unrolled/looped code might overwrite each others values with zeros. With my original fix this did not occur because I always only used one of the modes for all blocks. The looped code tries to use L.MemZero and the unrolled code might write zeros depending on the branches in generate_tensor_value_initialization

Together with these fixes it makes sense to rethink the current loop assignment approach as I mentioned in the issue (e.g. to use DOF map for sparse matrix like assignment like premultiplied does).

@blechta
Copy link
Member Author

blechta commented Jul 6, 2018

  1. memset should work. It writes zero bytes into memory. People say that it might not work because floating point zero might not be necessarily represented by zero memory but this is theoretical, I am not aware of any implementation like this.

Did you confirm that it does not work by running the code? This is what tsfc generates (for UFC interface, Firedrake kernels do not zero the tensor) and that runs fine in minidolfin which is C, not C++.

  1. Mixing should be possible although now it looks buggy. What about running first the unrolled code on whole tensor. That will zero whole tensor and add the unrolled parts. Then you add the loops which will just add to the tensor.

@blechta
Copy link
Member Author

blechta commented Jul 6, 2018

I pushed the fix for 2. but not tested. I think now it's pretty straightforward. Memset not needed anymore, although I believe it works.

@w1th0utnam3
Copy link
Contributor

Sounds good, I'll check tomorrow.
Ok, you're right I misread something with memset. But I got a compile error because of the call. I can have a look tomorrow but as you say, not needed anymore with this approach.

@w1th0utnam3
Copy link
Contributor

w1th0utnam3 commented Jul 6, 2018

Ah, memset is actually missing the zero as an argument, you're only passing in two arguments. (But I would still just replace it with a for-loop)

This fixes the mixed unrolled/looped case but the code is difficult to
follow. Refactor!
@blechta
Copy link
Member Author

blechta commented Jul 6, 2018

python3 -mffc -fmax_preintegrated_unrolled_table_size=12 NodalMini.ufl gives the mixed case. Looks good 😄

@w1th0utnam3
Copy link
Contributor

Nice. I guess the results look good, the generated code probably less so ;)

@w1th0utnam3
Copy link
Contributor

@blechta removed some more code: #40

Even less code duplictation for preintegrated blocks
# Index the static preintegrated table:
P_ii = P_entity_indices + P_arg_indices
A_rhs = f * PI[P_ii]
if blockdata.unroll:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we could have just if blockdata.unroll: continue in the beginning of the loop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch.

@blechta
Copy link
Member Author

blechta commented Jul 8, 2018

We could test with minidolfin for dependence of GCC behaviour on max_preintegrated_unrolled_table_size and/or GCC opt flags. Do you plan opening a minidolfin PR?

@w1th0utnam3
Copy link
Contributor

I thought about opening a PR at least for the FFC support if it's ok that it targets FFC-X?

@blechta
Copy link
Member Author

blechta commented Jul 8, 2018

Definitely FFC-X. (You could open also WIP PR with cross-cell parallelization. It would be good to see how disruptive the change is.)

@w1th0utnam3
Copy link
Contributor

w1th0utnam3 commented Jul 20, 2018

Actually, I did not find an element with pre-integration table sizes between 35^2 (e.g. CG_7(2D), works fine with GCC) and 44^2 (e.g. NED^1_3(3D), CG_8(2D), causes the issue). Do you have any ideas @blechta? Otherwise one could just use max_preintegrated_unrolled_table_size = 44**2 - 1 for the lack of other examples.

@w1th0utnam3 w1th0utnam3 mentioned this pull request Jul 20, 2018
@blechta
Copy link
Member Author

blechta commented Jul 26, 2018

Should we be rather on more safe side, i.e., trying to prevent the catastrophe for the case we don't know how they behave, i.e., setting the lower value? What about quite arbitrary 1536?

Sufficient granularity to test could give a mass matrix on VectorElement("P", cell, 2, dim=x)...

@w1th0utnam3
Copy link
Contributor

w1th0utnam3 commented Jul 26, 2018

Ah, good idea, I'll try using the vector element.

@w1th0utnam3
Copy link
Contributor

w1th0utnam3 commented Jul 28, 2018

@blechta I think the vector element dimension only effects the element matrix size, not the PI table size (probably reusing the same PI table for every component?)
Alternative to your arbitrary choice: use 35**2 + 1 to be on the safe side?

@blechta
Copy link
Member Author

blechta commented Jul 29, 2018

@w1th0utnam3 Yeah, makes sense, table are reused. One could go with P1 + P2 + P3 + ...
to get some granularity. But 35**2 + 1 is fine with me.

@w1th0utnam3
Copy link
Contributor

w1th0utnam3 commented Jul 29, 2018

Ok, I have to experiment a little bit more. Your last suggestion works and is useful to generate PI tables of arbitrary size. However, it turns out that this isn't really the problem. E.g. for mass matrix for Lagrange basis functions, a single huge PI table is generated. In the generated code, every line only contains a single assignment and GCC can compile it without a problem. In contrast, for the curl curl problem with Nedelec elements, the problem is that there are 45 PI tables of size 44x44 each, which results in huge assignment statements.

@blechta
Copy link
Member Author

blechta commented Jul 29, 2018

Good digging. We have to check a bit more.

@blechta
Copy link
Member Author

blechta commented Aug 4, 2018

So do I understand it correctly that long assignments are a problem? Do we know how to decide that at the IR stage?

@w1th0utnam3
Copy link
Contributor

w1th0utnam3 commented Aug 4, 2018

All information should be available but I guess that it's not super trivial to check. So the problem occurs when you have many overlapping blocks for a single entry of the cell matrix. However, I don't know if the issue already occurs when there is a single huge assignment or if there have to be many of them. But in general the solution would probably go in the direction of counting the number of block overlaps per entry in the cell matrix? And then maybe something like

overlaps = dict{a_ij -> number of overlapping blocks at a_ij}
huge_overlaps = list[a_ij for every a_ij if (overlaps[a_ij] > max_summands)]
if len(huge_overlaps) > max_huge_assignments:
    don't unroll any involved blocks

and then we have the two parameters max_summands and max_huge_assignments which are probably not easy to find. And counting the number of overlaps for large matrices is probably also not cheap.

@garth-wells
Copy link
Member

Closing for now - can revisit later once major restructuring is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants