Generate assignment loops for large preintegration tables #25

blechta · 2018-05-03T12:22:35Z

Fixes https://bitbucket.org/fenics-project/ffc/issues/173/uflacs-preintegration-emits-inefficient

Fix for issue 173: GCC (<= 7.3.0) eating all available memory in
optimization of the C++ code generated for an example with 45x45 element
tensor (mentioned in the issue). See the issue for more details.

Summary of changes: Modified UFLACS code generation to output loops
instead of linear assignments for forms with large pre-integration
tables (> 1024 entries).

Proposed fix for issue 173 (https://bitbucket.org/fenics-project/ffc/issues/173/uflacs-preintegration-emits-inefficient)

…9d0f.

garth-wells · 2018-05-03T18:11:51Z

FFCX is much stricter than FFC with the flake8 tests.

w1th0utnam3 · 2018-05-13T16:13:52Z

I'm a little bit confused about some flake8 errors: "undefined name 'block_rank'" in lines 985 and 988

ffcx/ffc/uflacs/integralgenerator.py

Line 985 in 5f6d990

for i in range(block_rank))

In line 981:

ffcx/ffc/uflacs/integralgenerator.py

Line 981 in 5f6d990

# block_rank = len(blockmap)

the definition of the variable is indeed commented out. In my original commit this was not the case. Did this happen accidentally?

garth-wells · 2018-05-13T16:33:07Z

Possibly - feel free to uncomment.

FFC was in very poor shape re static code checking so some aggressive linting was needed - more aggressive than one would normally wish.

w1th0utnam3 · 2018-05-14T16:02:55Z

No problem.
See #28

Fixes for flake8 errors

garth-wells · 2018-07-06T15:49:07Z

@blechta Could you look at fixing the merge conflicts, or close the PR?

w1th0utnam3 · 2018-07-06T15:55:15Z

We discussed that it's probably more sensible to catch the issue that motivated this PR at an earlier stage which might result in much less code changes. However, I did not look into this yet.

Resolved conflicts: ffc/uflacs/integralgenerator.py

w1th0utnam3 · 2018-07-06T18:31:37Z

Opened #38 for discussion

blechta · 2018-07-06T19:47:20Z

I cleaned the code a bit. Now there is a parameter max_preintegrated_unrolled_table_size and the code should allow mixing of unrolling and loops, because unrolling and inlining is decided per block. Nevertheless have to find an example when it happens.

w1th0utnam3 · 2018-07-06T20:49:54Z

The clean up is nice.

I'll look at it again tomorrow, but there are currently two problems with the fix in general:

The looped code currently uses L.MemZero which maps to memset which was left over from C++, currently this generates a compiler error because in C it can be only used for strings (https://en.cppreference.com/w/c/string/byte/memset), the L.MemZero should be removed or replaced with something C compatible
Mixing unrolled/looped code currently does not make sense because either ordering of unrolled/looped code might overwrite each others values with zeros. With my original fix this did not occur because I always only used one of the modes for all blocks. The looped code tries to use L.MemZero and the unrolled code might write zeros depending on the branches in generate_tensor_value_initialization

Together with these fixes it makes sense to rethink the current loop assignment approach as I mentioned in the issue (e.g. to use DOF map for sparse matrix like assignment like premultiplied does).

blechta · 2018-07-06T21:20:34Z

memset should work. It writes zero bytes into memory. People say that it might not work because floating point zero might not be necessarily represented by zero memory but this is theoretical, I am not aware of any implementation like this.

Did you confirm that it does not work by running the code? This is what tsfc generates (for UFC interface, Firedrake kernels do not zero the tensor) and that runs fine in minidolfin which is C, not C++.

Mixing should be possible although now it looks buggy. What about running first the unrolled code on whole tensor. That will zero whole tensor and add the unrolled parts. Then you add the loops which will just add to the tensor.

blechta · 2018-07-06T21:44:14Z

I pushed the fix for 2. but not tested. I think now it's pretty straightforward. Memset not needed anymore, although I believe it works.

w1th0utnam3 · 2018-07-06T21:50:34Z

Sounds good, I'll check tomorrow.
Ok, you're right I misread something with memset. But I got a compile error because of the call. I can have a look tomorrow but as you say, not needed anymore with this approach.

w1th0utnam3 · 2018-07-06T21:57:14Z

Ah, memset is actually missing the zero as an argument, you're only passing in two arguments. (But I would still just replace it with a for-loop)

This fixes the mixed unrolled/looped case but the code is difficult to follow. Refactor!

blechta · 2018-07-06T22:10:10Z

python3 -mffc -fmax_preintegrated_unrolled_table_size=12 NodalMini.ufl gives the mixed case. Looks good 😄

w1th0utnam3 · 2018-07-06T22:13:20Z

Nice. I guess the results look good, the generated code probably less so ;)

w1th0utnam3 · 2018-07-07T21:52:08Z

@blechta removed some more code: #40

Even less code duplictation for preintegrated blocks

blechta · 2018-07-08T07:53:42Z

ffc/uflacs/integralgenerator.py

-                    # Index the static preintegrated table:
-                    P_ii = P_entity_indices + P_arg_indices
-                    A_rhs = f * PI[P_ii]
+            if blockdata.unroll:


So we could have just if blockdata.unroll: continue in the beginning of the loop?

Yes, good catch.

Skip non-unrolled preintegrated blocks in unrolled code generation

Remove unnecessary line break

blechta · 2018-07-08T10:53:55Z

We could test with minidolfin for dependence of GCC behaviour on max_preintegrated_unrolled_table_size and/or GCC opt flags. Do you plan opening a minidolfin PR?

w1th0utnam3 · 2018-07-08T11:02:31Z

I thought about opening a PR at least for the FFC support if it's ok that it targets FFC-X?

blechta · 2018-07-08T11:11:11Z

Definitely FFC-X. (You could open also WIP PR with cross-cell parallelization. It would be good to see how disruptive the change is.)

w1th0utnam3 · 2018-07-20T16:06:11Z

Actually, I did not find an element with pre-integration table sizes between 35^2 (e.g. CG_7(2D), works fine with GCC) and 44^2 (e.g. NED^1_3(3D), CG_8(2D), causes the issue). Do you have any ideas @blechta? Otherwise one could just use max_preintegrated_unrolled_table_size = 44**2 - 1 for the lack of other examples.

This reverts commit 18f6bd3.

This reverts commit 125b2bf.

…ee39bcc49d0f." This reverts commit 2da258b.

Revert commits

blechta · 2018-07-26T14:21:14Z

Should we be rather on more safe side, i.e., trying to prevent the catastrophe for the case we don't know how they behave, i.e., setting the lower value? What about quite arbitrary 1536?

Sufficient granularity to test could give a mass matrix on VectorElement("P", cell, 2, dim=x)...

w1th0utnam3 · 2018-07-26T17:19:46Z

Ah, good idea, I'll try using the vector element.

w1th0utnam3 · 2018-07-28T09:13:46Z

@blechta I think the vector element dimension only effects the element matrix size, not the PI table size (probably reusing the same PI table for every component?)
Alternative to your arbitrary choice: use 35**2 + 1 to be on the safe side?

blechta · 2018-07-29T19:52:48Z

@w1th0utnam3 Yeah, makes sense, table are reused. One could go with P1 + P2 + P3 + ...
to get some granularity. But 35**2 + 1 is fine with me.

w1th0utnam3 · 2018-07-29T21:48:26Z

Ok, I have to experiment a little bit more. Your last suggestion works and is useful to generate PI tables of arbitrary size. However, it turns out that this isn't really the problem. E.g. for mass matrix for Lagrange basis functions, a single huge PI table is generated. In the generated code, every line only contains a single assignment and GCC can compile it without a problem. In contrast, for the curl curl problem with Nedelec elements, the problem is that there are 45 PI tables of size 44x44 each, which results in huge assignment statements.

blechta · 2018-07-29T23:01:29Z

Good digging. We have to check a bit more.

blechta · 2018-08-04T16:02:00Z

So do I understand it correctly that long assignments are a problem? Do we know how to decide that at the IR stage?

w1th0utnam3 · 2018-08-04T16:17:48Z

All information should be available but I guess that it's not super trivial to check. So the problem occurs when you have many overlapping blocks for a single entry of the cell matrix. However, I don't know if the issue already occurs when there is a single huge assignment or if there have to be many of them. But in general the solution would probably go in the direction of counting the number of block overlaps per entry in the cell matrix? And then maybe something like

overlaps = dict{a_ij -> number of overlapping blocks at a_ij}
huge_overlaps = list[a_ij for every a_ij if (overlaps[a_ij] > max_summands)]
if len(huge_overlaps) > max_huge_assignments:
    don't unroll any involved blocks

and then we have the two parameters max_summands and max_huge_assignments which are probably not easy to find. And counting the number of overlaps for large matrices is probably also not cheap.

garth-wells · 2018-11-23T20:36:56Z

Closing for now - can revisit later once major restructuring is done.

w1th0utnam3 and others added 3 commits May 3, 2018 14:17

Generate assignment loops for large preintegration tables

5f6d990

Proposed fix for issue 173 (https://bitbucket.org/fenics-project/ffc/issues/173/uflacs-preintegration-emits-inefficient)

Update reference data pointer to fd47851ba0790b5fad5e97d3fa0dee39bcc4…

2da258b

…9d0f.

Add Fabian to Authors

a84bc43

Fixes for flake8 errors

0b37f6b

Merge pull request #28 from w1th0utnam3/fabian/fix-issue-173-rebase

954e047

Fixes for flake8 errors

blechta added 3 commits July 6, 2018 18:30

Merge branch 'master' into fabian/fix-issue-173-rebase

6bc1126

Resolved conflicts: ffc/uflacs/integralgenerator.py

Remove remnant of conflict resolution

b08263e

Add back cnodes.MemZero using memset

125b2bf

w1th0utnam3 mentioned this pull request Jul 6, 2018

Inefficient code for "preintegrated" blocks of huge element tensors #38

Closed

Sort unrolling in preintegratom

8812964

blechta added 2 commits July 6, 2018 21:56

Revert change in demo

2948e39

flake8 fix

f05e44a

Fix zeroing preintegrated unrolled/looped mixed case

58a203d

Fix uninitialized variable in preintegration representation

4556fe2

This fixes the mixed unrolled/looped case but the code is difficult to follow. Refactor!

Fix bug in MemZero

18f6bd3

w1th0utnam3 added 2 commits July 7, 2018 23:35

Remove code duplication for non-unrolled preintegrated blocks

42a69a8

Fix flake8 error

b41e97f

Merge pull request #40 from w1th0utnam3/fabian/fix-issue-173-rebase

5006c67

Even less code duplictation for preintegrated blocks

blechta commented Jul 8, 2018

View reviewed changes

blechta and others added 5 commits July 8, 2018 10:12

Issue copyout comment

7df243f

Skip non-unrolled preintegrated blocks in unrolled code generation

ec64ccb

Merge pull request #41 from w1th0utnam3/fabian/fix-issue-173-rebase

1444670

Skip non-unrolled preintegrated blocks in unrolled code generation

Remove unnecessary line break

b8e51a7

Merge pull request #42 from w1th0utnam3/fabian/fix-issue-173-rebase

e97e618

Remove unnecessary line break

w1th0utnam3 approved these changes Jul 8, 2018

View reviewed changes

blechta and others added 3 commits July 20, 2018 18:27

Revert "Fix bug in MemZero"

3a99317

This reverts commit 18f6bd3.

Revert "Add back cnodes.MemZero using memset"

6565742

This reverts commit 125b2bf.

Revert "Update reference data pointer to fd47851ba0790b5fad5e97d3fa0d…

9550678

…ee39bcc49d0f." This reverts commit 2da258b.

w1th0utnam3 mentioned this pull request Jul 20, 2018

Revert commits #50

Merged

Merge pull request #50 from w1th0utnam3/fabian/fix-issue-173-rebase

45a7db1

Revert commits

garth-wells closed this Nov 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate assignment loops for large preintegration tables #25

Generate assignment loops for large preintegration tables #25

blechta commented May 3, 2018

garth-wells commented May 3, 2018

w1th0utnam3 commented May 13, 2018

garth-wells commented May 13, 2018 •

edited

Loading

w1th0utnam3 commented May 14, 2018

garth-wells commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018

blechta commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018 •

edited

Loading

blechta commented Jul 6, 2018

blechta commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018 •

edited

Loading

blechta commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018

w1th0utnam3 commented Jul 7, 2018

blechta Jul 8, 2018

w1th0utnam3 Jul 8, 2018

blechta commented Jul 8, 2018

w1th0utnam3 commented Jul 8, 2018

blechta commented Jul 8, 2018

w1th0utnam3 commented Jul 20, 2018 •

edited

Loading

blechta commented Jul 26, 2018

w1th0utnam3 commented Jul 26, 2018 •

edited

Loading

w1th0utnam3 commented Jul 28, 2018 •

edited

Loading

blechta commented Jul 29, 2018

w1th0utnam3 commented Jul 29, 2018 •

edited

Loading

blechta commented Jul 29, 2018

blechta commented Aug 4, 2018

w1th0utnam3 commented Aug 4, 2018 •

edited

Loading

garth-wells commented Nov 23, 2018

Generate assignment loops for large preintegration tables #25

Generate assignment loops for large preintegration tables #25

Conversation

blechta commented May 3, 2018

garth-wells commented May 3, 2018

w1th0utnam3 commented May 13, 2018

garth-wells commented May 13, 2018 • edited Loading

w1th0utnam3 commented May 14, 2018

garth-wells commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018

blechta commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018 • edited Loading

blechta commented Jul 6, 2018

blechta commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018 • edited Loading

blechta commented Jul 6, 2018

w1th0utnam3 commented Jul 6, 2018

w1th0utnam3 commented Jul 7, 2018

blechta Jul 8, 2018

Choose a reason for hiding this comment

w1th0utnam3 Jul 8, 2018

Choose a reason for hiding this comment

blechta commented Jul 8, 2018

w1th0utnam3 commented Jul 8, 2018

blechta commented Jul 8, 2018

w1th0utnam3 commented Jul 20, 2018 • edited Loading

blechta commented Jul 26, 2018

w1th0utnam3 commented Jul 26, 2018 • edited Loading

w1th0utnam3 commented Jul 28, 2018 • edited Loading

blechta commented Jul 29, 2018

w1th0utnam3 commented Jul 29, 2018 • edited Loading

blechta commented Jul 29, 2018

blechta commented Aug 4, 2018

w1th0utnam3 commented Aug 4, 2018 • edited Loading

garth-wells commented Nov 23, 2018

garth-wells commented May 13, 2018 •

edited

Loading

w1th0utnam3 commented Jul 6, 2018 •

edited

Loading

w1th0utnam3 commented Jul 6, 2018 •

edited

Loading

w1th0utnam3 commented Jul 20, 2018 •

edited

Loading

w1th0utnam3 commented Jul 26, 2018 •

edited

Loading

w1th0utnam3 commented Jul 28, 2018 •

edited

Loading

w1th0utnam3 commented Jul 29, 2018 •

edited

Loading

w1th0utnam3 commented Aug 4, 2018 •

edited

Loading