Matrix-free for deformable solid #1

Rombur · 2018-07-26T22:18:09Z

First, I have a general comment. Would it be possible to post here or on the website, the milestone reports that are produced as part of ECP. Right now it is hard for people not affiliated with ECP to follow the developments made by CEED. While someone could, in theory, stalk all the repositories that are affiliated with CEED, it would be a lot easier to just read the milestone reports.

Second, I have a question about the limitation of matrix-free methods. I have some colleagues simulating the deformation of solid materials with complex constitutive laws. In this case, computing the material properties on the fly is extremely costly due to the complexity of these laws that sometimes require the inversion of a small linear system. Yet precomputing these properties is not satisfactory either because we can end up storing a fourth-order tensor (about 80 coefficients) per quadrature points. This means that the matrix-free implementation requires more memory than the matrix-based implementation. We could of course try to store some elements of the constitutive laws and compute the rest on the fly but it is not obvious how to strike the right balance. Is there anyone in CEED who is facing a similar problem?

cc: @bangerth, @tjhei, @kronbichler, @masterleinad, @davydden, @jppelteret

jedbrown · 2018-07-26T22:38:42Z

The first question is for @tzanio because there are private repos for internal/reporting purposes.

Doesn't that dim-4 tensor have some symmetries? The crossover point may be different if you have very heavy coefficients, but the matrix-free representation is asymptotically better as the approximation order increases (and in practice the crossover tends to be pretty low, like p=2 or p=3).

v-dobrev · 2018-07-27T01:38:36Z

Here is a "back of the envelope" computation of the memory cost for matrix-free and sparse matrix (in CSR format) elasticity operator representations (produced by the python script below):

======================================================================
elasticity operator action, dimension: 3

    |    |  num. of |  num. of |  T-vec | ratio |  B/dof  |   B/dof
  p |  q | elements |   dofs   |   MiB  | #E/#T |    MF   |    CSR
----+----+----------+----------+--------+-------+---------+----------
  1 |  2 |  2985984 |  9145874 |  69.78 |  7.84 | 1723.84 |    976.0
  2 |  3 |   373248 |  9145874 |  69.78 |  3.31 |  727.24 |   2319.9
----+----+----------+----------+--------+-------+---------+----------
  1 |  3 |  2985984 |  9145874 |  69.78 |  7.84 | 5743.51 |    976.0
  2 |  4 |   373248 |  9145874 |  69.78 |  3.31 | 1705.72 |   2319.9
  3 |  5 |   110592 |  9145874 |  69.78 |  2.32 |  988.74 |   4541.3
  4 |  6 |    46656 |  9145874 |  69.78 |  1.91 |  721.67 |   7860.7
  5 |  7 |    21952 |  8409662 |  64.16 |  1.69 |  586.95 |  12502.7
  6 |  8 |    13824 |  9145874 |  69.78 |  1.56 |  507.70 |  18675.4
  7 |  9 |     8000 |  8409662 |  64.16 |  1.46 |  455.22 |  26622.0
  8 | 10 |     5832 |  9145874 |  69.78 |  1.39 |  418.78 |  36527.9
  9 | 11 |     4096 |  9145874 |  69.78 |  1.34 |  391.64 |  48644.6
 10 | 12 |     2744 |  8409662 |  64.16 |  1.30 |  370.57 |  63210.0
 11 | 13 |     2197 |  8957951 |  68.34 |  1.27 |  354.25 |  80370.3
 12 | 14 |     1728 |  9145874 |  69.78 |  1.25 |  340.93 | 100402.6
 13 | 15 |     1331 |  8957951 |  68.34 |  1.22 |  329.84 | 123540.3
 14 | 16 |     1000 |  8409662 |  64.16 |  1.20 |  320.43 | 150023.8
 15 | 17 |      729 |  7546367 |  57.57 |  1.19 |  312.30 | 180104.5
 16 | 18 |      729 |  9145874 |  69.78 |  1.17 |  305.93 | 213596.7
======================================================================

Here, p is the poynomial order, q is the number of quadrature points in 1D, B/dof MF is the number of bytes per dof for matrix free representation and B/dof CSR is the number of bytes per dof for csr matrix representation.

Unless I made a mistake, it turns out that only for p=1 it is less memory efficient to use matrix free representation.

Here is the python script:

from numpy import *


size_int=4
size_float=8

def set_params_3d():
   dim=3
   p1=arange(1,3)    # polynomial orders, q=p+1
   p2=arange(1,17)   # polynomial orders, q=p+2
   p=append(p1,p2)
   q=append(p1+1,p2+2)  # quadrature points in 1d
   qf=3**4              # number of floating point values per quadrature point
   one=ones(p.shape)
   nd = 3e6             # approximate (scalar) number of dofs
   E=floor(nd**(1./3)/p)**3  # number of elements based on approx. num. of dofs
   n=(E**(1./3)*p+1)**3  # number of scalar dofs on a E^(1/3) x E^(1/3) x E^(1/3) mesh
   return dim,p1,p,q,qf,E,n

# table formatting:
fmt='%3d | %2d |%9d |%9d |%7.2f |%6.2f |%8.2f |%9.1f'
row1='    |    |  num. of |  num. of |  T-vec | ratio |  B/dof  |   B/dof'
row2='  p |  q | elements |   dofs   |   MiB  | #E/#T |    MF   |    CSR'
sep= '----+----+----------+----------+--------+-------+---------+----------'
big_sep='='*70

def print_table(t):
   print row1
   print row2
   print sep
   for i in range(len(t)):
      if i==len(p1):
	 print sep
      print fmt%tuple(t[i])


# ----------------------------------------------------------
#    HEXES ELASTICITY
# ----------------------------------------------------------

def elast_hex_mf_bytes(p,q,qf,E):
   # Element-to-dof array + quadrature point data
   return (E*dim*(p+1)**3*size_int+(E*q**3)*qf*size_float)

def elast_hex_csr_bytes(p,E):
   e=E**(1./3)
   # csr nnz on a periodic (e x e x e) mesh
   # num vert dofs x num vert dof connections:
   nnz=dim*(e+1)**3 * dim*(2*p+1)**3
   # add num edge dofs x num edge dof connections:
   nnz=nnz + dim*(p-1)*(dim*e*(e+1)**2) * dim*(2*p+1)**2*(p+1)
   # add num face dofs x num face dof connections:
   nnz=nnz + dim*(p-1)**2*(dim*e**2*(e+1)) * dim*(2*p+1)*(p+1)**2
   # add num element dofs x num element dof connections:
   nnz=nnz + dim*(p-1)**3*(e**3) * dim*(p+1)**3
   # 1 int + 1 float for every nonzero entry + (mat_size+1) ints
   return nnz*(size_int+size_float)+(dim*n+1)*size_int


dim,p1,p,q,qf,E,n=set_params_3d()
print big_sep
print 'elasticity operator action, dimension:', dim
c5=dim*n*size_float/2.**20              # T-vector size in MiB
c6=E*(p+1)**3/(1.*n)                    # E-vector size / T-vector size
c7=elast_hex_mf_bytes(p,q,qf,E)/(1.*dim*n) # mf bytes per dof
c8=elast_hex_csr_bytes(p,E)/(1.*dim*n)  # csr bytes per dof

t=array([p,q,E,dim*n,c5,c6,c7,c8]).transpose()
print
print_table(t)

print big_sep

tzanio · 2018-07-27T01:51:00Z

Hi Bruno,

Regarding your first question, we will discuss internally if everyone is comfortable with making the milestone reports public and get back to you. We do try to highlight the major developments on our News page, http://ceed.exascaleproject.org/news/, but I agree that the technical content in the reports can be interesting and valuable to an external audience.

Cheers,
Tzanio

Rombur · 2018-07-27T17:49:46Z

@v-dobev Thanks for the estimation and for the script! This is actually much better than what we expected. They said that they will try to use matrix-free in their application. We will keep you posted. Thanks again.

@tzanio Thanks.

jedbrown · 2018-07-27T22:01:28Z

This figure plots cost estimates with a common symmetry symmetry.

v-dobrev · 2018-07-28T01:26:50Z

Here is a more realistic version of the CSR storage computation where the sparsity pattern for one scalar components is reused for all 3 x 3 sparse blocks:

======================================================================
elasticity operator action, dimension: 3

    |    |  num. of |  num. of |  T-vec | ratio |  B/dof  |   B/dof
  p |  q | elements |   dofs   |   MiB  | #E/#T |    MF   |    CSR
----+----+----------+----------+--------+-------+---------+----------
  1 |  2 |  2985984 |  9145874 |  69.78 |  7.84 | 1723.84 |    675.9
  2 |  3 |   373248 |  9145874 |  69.78 |  3.31 |  727.24 |   1597.6
----+----+----------+----------+--------+-------+---------+----------
  1 |  3 |  2985984 |  9145874 |  69.78 |  7.84 | 5743.51 |    675.9
  2 |  4 |   373248 |  9145874 |  69.78 |  3.31 | 1705.72 |   1597.6
  3 |  5 |   110592 |  9145874 |  69.78 |  2.32 |  988.74 |   3115.9
  4 |  6 |    46656 |  9145874 |  69.78 |  1.91 |  721.67 |   5379.5
  5 |  7 |    21952 |  8409662 |  64.16 |  1.69 |  586.95 |   8533.2
  6 |  8 |    13824 |  9145874 |  69.78 |  1.56 |  507.70 |  12738.6
  7 |  9 |     8000 |  8409662 |  64.16 |  1.46 |  455.22 |  18122.3
  8 | 10 |     5832 |  9145874 |  69.78 |  1.39 |  418.78 |  24865.9
  9 | 11 |     4096 |  9145874 |  69.78 |  1.34 |  391.64 |  33089.8
 10 | 12 |     2744 |  8409662 |  64.16 |  1.30 |  370.57 |  42929.1
 11 | 13 |     2197 |  8957951 |  68.34 |  1.27 |  354.25 |  54595.2
 12 | 14 |     1728 |  9145874 |  69.78 |  1.25 |  340.93 |  68189.0
 13 | 15 |     1331 |  8957951 |  68.34 |  1.22 |  329.84 |  83849.6
 14 | 16 |     1000 |  8409662 |  64.16 |  1.20 |  320.43 | 101710.6
 15 | 17 |      729 |  7546367 |  57.57 |  1.19 |  312.30 | 121897.8
 16 | 18 |      729 |  9145874 |  69.78 |  1.17 |  305.93 | 144877.1
======================================================================

Note that still the cheapest CSR case (p=1) uses 675.9 bytes/dof while a moderate order (p=5) MF needs less: 586.95 bytes/dof. If you go to order p=12, MF is about 2x cheaper in terms of memory.

Here is the updated python script:

from numpy import *


size_int=4
size_float=8

def set_params_3d():
   dim=3
   p1=arange(1,3)    # polynomial orders, q=p+1
   p2=arange(1,17)   # polynomial orders, q=p+2
   p=append(p1,p2)
   q=append(p1+1,p2+2)  # quadrature points in 1d
   qf=3**4              # number of floating point values per quadrature point
   one=ones(p.shape)
   nd = 3e6             # approximate (scalar) number of dofs
   E=floor(nd**(1./3)/p)**3  # number of elements based on approx. num. of dofs
   n=(E**(1./3)*p+1)**3  # number of scalar dofs on a E^(1/3) x E^(1/3) x E^(1/3) mesh
   return dim,p1,p,q,qf,E,n

# table formatting:
fmt='%3d | %2d |%9d |%9d |%7.2f |%6.2f |%8.2f |%9.1f'
row1='    |    |  num. of |  num. of |  T-vec | ratio |  B/dof  |   B/dof'
row2='  p |  q | elements |   dofs   |   MiB  | #E/#T |    MF   |    CSR'
sep= '----+----+----------+----------+--------+-------+---------+----------'
big_sep='='*70

def print_table(t):
   print row1
   print row2
   print sep
   for i in range(len(t)):
      if i==len(p1):
	 print sep
      print fmt%tuple(t[i])


# ----------------------------------------------------------
#    HEXES ELASTICITY
# ----------------------------------------------------------

def elast_hex_mf_bytes(p,q,qf,E):
   # Element-to-dof array + quadrature point data
   return (E*dim*(p+1)**3*size_int+(E*q**3)*qf*size_float)

def elast_hex_csr_bytes(p,E,n):
   e=E**(1./3)
   #
   # csr nnz on a (e x e x e) hex mesh
   # (one scalar component, full local matrices)
   #
   # dofs on vertices with one adjacent element:
   nnz=2**3 * (p+1)**3
   # dofs on vertices with two adjacent elements:
   nnz=nnz + 12*(e-1) * (2*p+1)*(p+1)**2
   # dofs on vertices with four adjacent elements:
   nnz=nnz + 6*(e-1)**2 * (2*p+1)**2*(p+1)
   # dofs on vertices with eight adjacent elements:
   nnz=nnz + (e-1)**3 * (2*p+1)**3
   #
   # dofs on edges with one adjacent element:
   nnz=nnz + (p-1)*12*e * (p+1)**3
   # dofs on edges with two adjacent elements:
   nnz=nnz + (p-1)*6*2*e*(e-1) * (2*p+1)*(p+1)**2
   # dofs on edges with four adjacent elements:
   nnz=nnz + (p-1)*3*e*(e-1)**2 * (2*p+1)**2*(p+1)
   #
   # dofs on faces with one adjacent element:
   nnz=nnz + (p-1)**2*6*e**2 * (p+1)**3
   # dofs on faces with two adjacent elements:
   nnz=nnz + (p-1)**2*3*e**2*(e-1) * (2*p+1)*(p+1)**2
   #
   # dofs inside elements:
   nnz=nnz + (p-1)**3*e**3 * (p+1)**3
   #
   # 1 int + 1 float for every nonzero entry + (mat_size+1) ints
   # return 9*nnz*(size_int+size_float)+(3*n+1)*size_int
   #
   # reuse the scalar sparsity pattern for all 9 blocks:
   return 9*nnz*size_float + nnz*size_int + (n+1)*size_int
   #
   # ignoring any sparsity pattern storage
   # return 9*nnz*size_float


dim,p1,p,q,qf,E,n=set_params_3d()
print big_sep
print 'elasticity operator action, dimension:', dim
c5=dim*n*size_float/2.**20              # T-vector size in MiB
c6=E*(p+1)**3/(1.*n)                    # E-vector size / T-vector size
c7=elast_hex_mf_bytes(p,q,qf,E)/(1.*dim*n) # mf bytes per dof
c8=elast_hex_csr_bytes(p,E,n)/(1.*dim*n)  # csr bytes per dof

t=array([p,q,E,dim*n,c5,c6,c7,c8]).transpose()
print
print_table(t)

print big_sep

jppelteret · 2018-07-30T19:23:46Z

Thanks for the estimation and for the script! This is actually much better than what we expected. They said that they will try to use matrix-free in their application. We will keep you posted. Thanks again.

I just wanted to echo @Rombur's thanks to both @v-dobrev and @jedbrown. This led to an interesting discussion on our side, and as @Rombur said we'll be investigating this more and will report back when we have some results to share.

v-dobrev · 2018-07-30T19:49:08Z

You are welcome! Glad this is helpful.

tzanio · 2018-09-21T23:40:55Z

The CEED reports are now available here:

http://ceed.exascaleproject.org/pubs/#ceed-reports

jedbrown · 2020-04-09T17:19:41Z

@Rombur @jppelteret With regard to solid mechanics and efficient storage for the matrix-free Jacobian, please check out the discussion in this section of the libCEED docs, including the note on storage complexity below. https://libceed.readthedocs.io/en/latest/examples/solids/#id3
(You can also check out the implementation. This example also has matrix-free multigrid.)

masterleinad · 2020-04-09T17:30:50Z

We finally investigated different caching strategies in https://onlinelibrary.wiley.com/doi/abs/10.1002/nme.6336. For the example considered there, it was best to just cache a second-rank tensor while caching the full rank-4 tensor was slowest.

jedbrown · 2020-04-09T20:20:44Z

Nice paper @masterleinad. It looks like Table 1 Algorithm 3 isn't exploiting symmetry in the fourth order elasticity tensor C, would make it 21 instead of 36 independent entries (in 3D). Our currently-implemented formulation is similar to the discussion at the end of your appending, but (I think) exploiting more symmetry. Fusing the metric transformation to the current configuration for Jacobian application (as in your Algorithm 2) is also clearly a good choice. It'll be interesting to make quantitative comparisons. Is your code available?

masterleinad · 2020-04-13T15:48:33Z

The code can be found at https://github.com/davydden/large-strain-matrix-free/.

jppelteret · 2020-04-13T18:33:14Z

Thanks @jedbrown for the link to the CEED documentation. That example and writeup looks great! And thanks once more to you and @v-dobrev for the initial discussions on this topic. I was going to pass along the link to the paper, but @masterleinad seems to have beaten me to it :-)

isn't exploiting symmetry in the fourth order elasticity tensor C, would make it 21 instead of 36 independent entries (in 3D)

This is the current implementation in of symmetric tensors deal.II (they lack major symmetry). Implementing a new tensor class was a little beyond the scope of the work, so we didn't consider it at all.

Our currently-implemented formulation is similar to the discussion at the end of your appending

Great that you got that covered already :-)

kronbichler · 2020-04-13T20:34:45Z

@jedbrown I'd be very happy in looking into your implementations as well; it's been an interesting read through your materials and the general overview - let me know when you start looking into our algorithms so I can dig deeper into yours and make cross-comparisons. In any case, a discussion from both teams might provide further insight into what we are doing, so 👍

jedbrown · 2020-04-14T02:09:04Z

Indeed! We'd like to refine a couple things in our implementation and set up some good comparisons. We've been doing experiments with load increments (which have lots of opportunities for nonlinear multilevel methods), but make direct comparison a bit tricky. We'll have a look at your code and see if we can set up a fair comparison.

One thing we noticed that made a significant impact for the stored-nothing Jacobian method was to replace log1p with a series expansion.

Our implementation will be ready for GPU comparisons once we can compute the diagonal of the Jacobian efficiently there.

Overall, I think the approaches that your group and ours are using should have significant impact on industrial practice, where assembled matrices (often with quadratic elements) are the norm.

tzanio added question CEED labels Jul 27, 2018

tzanio assigned jedbrown, tzanio and v-dobrev Jul 27, 2018

davydden mentioned this issue Aug 23, 2018

design 2D experiments davydden/large-strain-matrix-free#35

Closed

masterleinad mentioned this issue Jan 23, 2019

Reviewers comments davydden/large-strain-matrix-free#52

Closed

53 tasks

tzanio closed this as completed Apr 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix-free for deformable solid #1

Matrix-free for deformable solid #1

Rombur commented Jul 26, 2018

jedbrown commented Jul 26, 2018

v-dobrev commented Jul 27, 2018

tzanio commented Jul 27, 2018

Rombur commented Jul 27, 2018

jedbrown commented Jul 27, 2018

v-dobrev commented Jul 28, 2018

jppelteret commented Jul 30, 2018

v-dobrev commented Jul 30, 2018

tzanio commented Sep 21, 2018

jedbrown commented Apr 9, 2020

masterleinad commented Apr 9, 2020

jedbrown commented Apr 9, 2020

masterleinad commented Apr 13, 2020

jppelteret commented Apr 13, 2020

kronbichler commented Apr 13, 2020

jedbrown commented Apr 14, 2020

Matrix-free for deformable solid #1

Matrix-free for deformable solid #1

Comments

Rombur commented Jul 26, 2018

jedbrown commented Jul 26, 2018

v-dobrev commented Jul 27, 2018

tzanio commented Jul 27, 2018

Rombur commented Jul 27, 2018

jedbrown commented Jul 27, 2018

v-dobrev commented Jul 28, 2018

jppelteret commented Jul 30, 2018

v-dobrev commented Jul 30, 2018

tzanio commented Sep 21, 2018

jedbrown commented Apr 9, 2020

masterleinad commented Apr 9, 2020

jedbrown commented Apr 9, 2020

masterleinad commented Apr 13, 2020

jppelteret commented Apr 13, 2020

kronbichler commented Apr 13, 2020

jedbrown commented Apr 14, 2020