Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix-free for deformable solid #1

Closed
Rombur opened this issue Jul 26, 2018 · 16 comments
Closed

Matrix-free for deformable solid #1

Rombur opened this issue Jul 26, 2018 · 16 comments
Assignees

Comments

@Rombur
Copy link

Rombur commented Jul 26, 2018

First, I have a general comment. Would it be possible to post here or on the website, the milestone reports that are produced as part of ECP. Right now it is hard for people not affiliated with ECP to follow the developments made by CEED. While someone could, in theory, stalk all the repositories that are affiliated with CEED, it would be a lot easier to just read the milestone reports.

Second, I have a question about the limitation of matrix-free methods. I have some colleagues simulating the deformation of solid materials with complex constitutive laws. In this case, computing the material properties on the fly is extremely costly due to the complexity of these laws that sometimes require the inversion of a small linear system. Yet precomputing these properties is not satisfactory either because we can end up storing a fourth-order tensor (about 80 coefficients) per quadrature points. This means that the matrix-free implementation requires more memory than the matrix-based implementation. We could of course try to store some elements of the constitutive laws and compute the rest on the fly but it is not obvious how to strike the right balance. Is there anyone in CEED who is facing a similar problem?

cc: @bangerth, @tjhei, @kronbichler, @masterleinad, @davydden, @jppelteret

@jedbrown
Copy link
Member

The first question is for @tzanio because there are private repos for internal/reporting purposes.

Doesn't that dim-4 tensor have some symmetries? The crossover point may be different if you have very heavy coefficients, but the matrix-free representation is asymptotically better as the approximation order increases (and in practice the crossover tends to be pretty low, like p=2 or p=3).

@v-dobrev
Copy link
Member

Here is a "back of the envelope" computation of the memory cost for matrix-free and sparse matrix (in CSR format) elasticity operator representations (produced by the python script below):

======================================================================
elasticity operator action, dimension: 3

    |    |  num. of |  num. of |  T-vec | ratio |  B/dof  |   B/dof
  p |  q | elements |   dofs   |   MiB  | #E/#T |    MF   |    CSR
----+----+----------+----------+--------+-------+---------+----------
  1 |  2 |  2985984 |  9145874 |  69.78 |  7.84 | 1723.84 |    976.0
  2 |  3 |   373248 |  9145874 |  69.78 |  3.31 |  727.24 |   2319.9
----+----+----------+----------+--------+-------+---------+----------
  1 |  3 |  2985984 |  9145874 |  69.78 |  7.84 | 5743.51 |    976.0
  2 |  4 |   373248 |  9145874 |  69.78 |  3.31 | 1705.72 |   2319.9
  3 |  5 |   110592 |  9145874 |  69.78 |  2.32 |  988.74 |   4541.3
  4 |  6 |    46656 |  9145874 |  69.78 |  1.91 |  721.67 |   7860.7
  5 |  7 |    21952 |  8409662 |  64.16 |  1.69 |  586.95 |  12502.7
  6 |  8 |    13824 |  9145874 |  69.78 |  1.56 |  507.70 |  18675.4
  7 |  9 |     8000 |  8409662 |  64.16 |  1.46 |  455.22 |  26622.0
  8 | 10 |     5832 |  9145874 |  69.78 |  1.39 |  418.78 |  36527.9
  9 | 11 |     4096 |  9145874 |  69.78 |  1.34 |  391.64 |  48644.6
 10 | 12 |     2744 |  8409662 |  64.16 |  1.30 |  370.57 |  63210.0
 11 | 13 |     2197 |  8957951 |  68.34 |  1.27 |  354.25 |  80370.3
 12 | 14 |     1728 |  9145874 |  69.78 |  1.25 |  340.93 | 100402.6
 13 | 15 |     1331 |  8957951 |  68.34 |  1.22 |  329.84 | 123540.3
 14 | 16 |     1000 |  8409662 |  64.16 |  1.20 |  320.43 | 150023.8
 15 | 17 |      729 |  7546367 |  57.57 |  1.19 |  312.30 | 180104.5
 16 | 18 |      729 |  9145874 |  69.78 |  1.17 |  305.93 | 213596.7
======================================================================

Here, p is the poynomial order, q is the number of quadrature points in 1D, B/dof MF is the number of bytes per dof for matrix free representation and B/dof CSR is the number of bytes per dof for csr matrix representation.

Unless I made a mistake, it turns out that only for p=1 it is less memory efficient to use matrix free representation.

Here is the python script:

from numpy import *


size_int=4
size_float=8

def set_params_3d():
   dim=3
   p1=arange(1,3)    # polynomial orders, q=p+1
   p2=arange(1,17)   # polynomial orders, q=p+2
   p=append(p1,p2)
   q=append(p1+1,p2+2)  # quadrature points in 1d
   qf=3**4              # number of floating point values per quadrature point
   one=ones(p.shape)
   nd = 3e6             # approximate (scalar) number of dofs
   E=floor(nd**(1./3)/p)**3  # number of elements based on approx. num. of dofs
   n=(E**(1./3)*p+1)**3  # number of scalar dofs on a E^(1/3) x E^(1/3) x E^(1/3) mesh
   return dim,p1,p,q,qf,E,n

# table formatting:
fmt='%3d | %2d |%9d |%9d |%7.2f |%6.2f |%8.2f |%9.1f'
row1='    |    |  num. of |  num. of |  T-vec | ratio |  B/dof  |   B/dof'
row2='  p |  q | elements |   dofs   |   MiB  | #E/#T |    MF   |    CSR'
sep= '----+----+----------+----------+--------+-------+---------+----------'
big_sep='='*70

def print_table(t):
   print row1
   print row2
   print sep
   for i in range(len(t)):
      if i==len(p1):
	 print sep
      print fmt%tuple(t[i])


# ----------------------------------------------------------
#    HEXES ELASTICITY
# ----------------------------------------------------------

def elast_hex_mf_bytes(p,q,qf,E):
   # Element-to-dof array + quadrature point data
   return (E*dim*(p+1)**3*size_int+(E*q**3)*qf*size_float)

def elast_hex_csr_bytes(p,E):
   e=E**(1./3)
   # csr nnz on a periodic (e x e x e) mesh
   # num vert dofs x num vert dof connections:
   nnz=dim*(e+1)**3 * dim*(2*p+1)**3
   # add num edge dofs x num edge dof connections:
   nnz=nnz + dim*(p-1)*(dim*e*(e+1)**2) * dim*(2*p+1)**2*(p+1)
   # add num face dofs x num face dof connections:
   nnz=nnz + dim*(p-1)**2*(dim*e**2*(e+1)) * dim*(2*p+1)*(p+1)**2
   # add num element dofs x num element dof connections:
   nnz=nnz + dim*(p-1)**3*(e**3) * dim*(p+1)**3
   # 1 int + 1 float for every nonzero entry + (mat_size+1) ints
   return nnz*(size_int+size_float)+(dim*n+1)*size_int


dim,p1,p,q,qf,E,n=set_params_3d()
print big_sep
print 'elasticity operator action, dimension:', dim
c5=dim*n*size_float/2.**20              # T-vector size in MiB
c6=E*(p+1)**3/(1.*n)                    # E-vector size / T-vector size
c7=elast_hex_mf_bytes(p,q,qf,E)/(1.*dim*n) # mf bytes per dof
c8=elast_hex_csr_bytes(p,E)/(1.*dim*n)  # csr bytes per dof

t=array([p,q,E,dim*n,c5,c6,c7,c8]).transpose()
print
print_table(t)

print big_sep

@tzanio
Copy link
Member

tzanio commented Jul 27, 2018

Hi Bruno,

Regarding your first question, we will discuss internally if everyone is comfortable with making the milestone reports public and get back to you. We do try to highlight the major developments on our News page, http://ceed.exascaleproject.org/news/, but I agree that the technical content in the reports can be interesting and valuable to an external audience.

Cheers,
Tzanio

@Rombur
Copy link
Author

Rombur commented Jul 27, 2018

@v-dobev Thanks for the estimation and for the script! This is actually much better than what we expected. They said that they will try to use matrix-free in their application. We will keep you posted. Thanks again.

@tzanio Thanks.

@jedbrown
Copy link
Member

This figure plots cost estimates with a common symmetry symmetry.

@v-dobrev
Copy link
Member

Here is a more realistic version of the CSR storage computation where the sparsity pattern for one scalar components is reused for all 3 x 3 sparse blocks:

======================================================================
elasticity operator action, dimension: 3

    |    |  num. of |  num. of |  T-vec | ratio |  B/dof  |   B/dof
  p |  q | elements |   dofs   |   MiB  | #E/#T |    MF   |    CSR
----+----+----------+----------+--------+-------+---------+----------
  1 |  2 |  2985984 |  9145874 |  69.78 |  7.84 | 1723.84 |    675.9
  2 |  3 |   373248 |  9145874 |  69.78 |  3.31 |  727.24 |   1597.6
----+----+----------+----------+--------+-------+---------+----------
  1 |  3 |  2985984 |  9145874 |  69.78 |  7.84 | 5743.51 |    675.9
  2 |  4 |   373248 |  9145874 |  69.78 |  3.31 | 1705.72 |   1597.6
  3 |  5 |   110592 |  9145874 |  69.78 |  2.32 |  988.74 |   3115.9
  4 |  6 |    46656 |  9145874 |  69.78 |  1.91 |  721.67 |   5379.5
  5 |  7 |    21952 |  8409662 |  64.16 |  1.69 |  586.95 |   8533.2
  6 |  8 |    13824 |  9145874 |  69.78 |  1.56 |  507.70 |  12738.6
  7 |  9 |     8000 |  8409662 |  64.16 |  1.46 |  455.22 |  18122.3
  8 | 10 |     5832 |  9145874 |  69.78 |  1.39 |  418.78 |  24865.9
  9 | 11 |     4096 |  9145874 |  69.78 |  1.34 |  391.64 |  33089.8
 10 | 12 |     2744 |  8409662 |  64.16 |  1.30 |  370.57 |  42929.1
 11 | 13 |     2197 |  8957951 |  68.34 |  1.27 |  354.25 |  54595.2
 12 | 14 |     1728 |  9145874 |  69.78 |  1.25 |  340.93 |  68189.0
 13 | 15 |     1331 |  8957951 |  68.34 |  1.22 |  329.84 |  83849.6
 14 | 16 |     1000 |  8409662 |  64.16 |  1.20 |  320.43 | 101710.6
 15 | 17 |      729 |  7546367 |  57.57 |  1.19 |  312.30 | 121897.8
 16 | 18 |      729 |  9145874 |  69.78 |  1.17 |  305.93 | 144877.1
======================================================================

Note that still the cheapest CSR case (p=1) uses 675.9 bytes/dof while a moderate order (p=5) MF needs less: 586.95 bytes/dof. If you go to order p=12, MF is about 2x cheaper in terms of memory.

Here is the updated python script:

from numpy import *


size_int=4
size_float=8

def set_params_3d():
   dim=3
   p1=arange(1,3)    # polynomial orders, q=p+1
   p2=arange(1,17)   # polynomial orders, q=p+2
   p=append(p1,p2)
   q=append(p1+1,p2+2)  # quadrature points in 1d
   qf=3**4              # number of floating point values per quadrature point
   one=ones(p.shape)
   nd = 3e6             # approximate (scalar) number of dofs
   E=floor(nd**(1./3)/p)**3  # number of elements based on approx. num. of dofs
   n=(E**(1./3)*p+1)**3  # number of scalar dofs on a E^(1/3) x E^(1/3) x E^(1/3) mesh
   return dim,p1,p,q,qf,E,n

# table formatting:
fmt='%3d | %2d |%9d |%9d |%7.2f |%6.2f |%8.2f |%9.1f'
row1='    |    |  num. of |  num. of |  T-vec | ratio |  B/dof  |   B/dof'
row2='  p |  q | elements |   dofs   |   MiB  | #E/#T |    MF   |    CSR'
sep= '----+----+----------+----------+--------+-------+---------+----------'
big_sep='='*70

def print_table(t):
   print row1
   print row2
   print sep
   for i in range(len(t)):
      if i==len(p1):
	 print sep
      print fmt%tuple(t[i])


# ----------------------------------------------------------
#    HEXES ELASTICITY
# ----------------------------------------------------------

def elast_hex_mf_bytes(p,q,qf,E):
   # Element-to-dof array + quadrature point data
   return (E*dim*(p+1)**3*size_int+(E*q**3)*qf*size_float)

def elast_hex_csr_bytes(p,E,n):
   e=E**(1./3)
   #
   # csr nnz on a (e x e x e) hex mesh
   # (one scalar component, full local matrices)
   #
   # dofs on vertices with one adjacent element:
   nnz=2**3 * (p+1)**3
   # dofs on vertices with two adjacent elements:
   nnz=nnz + 12*(e-1) * (2*p+1)*(p+1)**2
   # dofs on vertices with four adjacent elements:
   nnz=nnz + 6*(e-1)**2 * (2*p+1)**2*(p+1)
   # dofs on vertices with eight adjacent elements:
   nnz=nnz + (e-1)**3 * (2*p+1)**3
   #
   # dofs on edges with one adjacent element:
   nnz=nnz + (p-1)*12*e * (p+1)**3
   # dofs on edges with two adjacent elements:
   nnz=nnz + (p-1)*6*2*e*(e-1) * (2*p+1)*(p+1)**2
   # dofs on edges with four adjacent elements:
   nnz=nnz + (p-1)*3*e*(e-1)**2 * (2*p+1)**2*(p+1)
   #
   # dofs on faces with one adjacent element:
   nnz=nnz + (p-1)**2*6*e**2 * (p+1)**3
   # dofs on faces with two adjacent elements:
   nnz=nnz + (p-1)**2*3*e**2*(e-1) * (2*p+1)*(p+1)**2
   #
   # dofs inside elements:
   nnz=nnz + (p-1)**3*e**3 * (p+1)**3
   #
   # 1 int + 1 float for every nonzero entry + (mat_size+1) ints
   # return 9*nnz*(size_int+size_float)+(3*n+1)*size_int
   #
   # reuse the scalar sparsity pattern for all 9 blocks:
   return 9*nnz*size_float + nnz*size_int + (n+1)*size_int
   #
   # ignoring any sparsity pattern storage
   # return 9*nnz*size_float


dim,p1,p,q,qf,E,n=set_params_3d()
print big_sep
print 'elasticity operator action, dimension:', dim
c5=dim*n*size_float/2.**20              # T-vector size in MiB
c6=E*(p+1)**3/(1.*n)                    # E-vector size / T-vector size
c7=elast_hex_mf_bytes(p,q,qf,E)/(1.*dim*n) # mf bytes per dof
c8=elast_hex_csr_bytes(p,E,n)/(1.*dim*n)  # csr bytes per dof

t=array([p,q,E,dim*n,c5,c6,c7,c8]).transpose()
print
print_table(t)

print big_sep

@jppelteret
Copy link

Thanks for the estimation and for the script! This is actually much better than what we expected. They said that they will try to use matrix-free in their application. We will keep you posted. Thanks again.

I just wanted to echo @Rombur's thanks to both @v-dobrev and @jedbrown. This led to an interesting discussion on our side, and as @Rombur said we'll be investigating this more and will report back when we have some results to share.

@v-dobrev
Copy link
Member

You are welcome! Glad this is helpful.

@tzanio
Copy link
Member

tzanio commented Sep 21, 2018

The CEED reports are now available here:

http://ceed.exascaleproject.org/pubs/#ceed-reports

@jedbrown
Copy link
Member

jedbrown commented Apr 9, 2020

@Rombur @jppelteret With regard to solid mechanics and efficient storage for the matrix-free Jacobian, please check out the discussion in this section of the libCEED docs, including the note on storage complexity below. https://libceed.readthedocs.io/en/latest/examples/solids/#id3
(You can also check out the implementation. This example also has matrix-free multigrid.)

@masterleinad
Copy link

We finally investigated different caching strategies in https://onlinelibrary.wiley.com/doi/abs/10.1002/nme.6336. For the example considered there, it was best to just cache a second-rank tensor while caching the full rank-4 tensor was slowest.

@jedbrown
Copy link
Member

jedbrown commented Apr 9, 2020

Nice paper @masterleinad. It looks like Table 1 Algorithm 3 isn't exploiting symmetry in the fourth order elasticity tensor C, would make it 21 instead of 36 independent entries (in 3D). Our currently-implemented formulation is similar to the discussion at the end of your appending, but (I think) exploiting more symmetry. Fusing the metric transformation to the current configuration for Jacobian application (as in your Algorithm 2) is also clearly a good choice. It'll be interesting to make quantitative comparisons. Is your code available?

@masterleinad
Copy link

The code can be found at https://github.com/davydden/large-strain-matrix-free/.

@jppelteret
Copy link

Thanks @jedbrown for the link to the CEED documentation. That example and writeup looks great! And thanks once more to you and @v-dobrev for the initial discussions on this topic. I was going to pass along the link to the paper, but @masterleinad seems to have beaten me to it :-)

isn't exploiting symmetry in the fourth order elasticity tensor C, would make it 21 instead of 36 independent entries (in 3D)

This is the current implementation in of symmetric tensors deal.II (they lack major symmetry). Implementing a new tensor class was a little beyond the scope of the work, so we didn't consider it at all.

Our currently-implemented formulation is similar to the discussion at the end of your appending

Great that you got that covered already :-)

@kronbichler
Copy link

@jedbrown I'd be very happy in looking into your implementations as well; it's been an interesting read through your materials and the general overview - let me know when you start looking into our algorithms so I can dig deeper into yours and make cross-comparisons. In any case, a discussion from both teams might provide further insight into what we are doing, so 👍

@jedbrown
Copy link
Member

Indeed! We'd like to refine a couple things in our implementation and set up some good comparisons. We've been doing experiments with load increments (which have lots of opportunities for nonlinear multilevel methods), but make direct comparison a bit tricky. We'll have a look at your code and see if we can set up a fair comparison.

One thing we noticed that made a significant impact for the stored-nothing Jacobian method was to replace log1p with a series expansion.

Our implementation will be ready for GPU comparisons once we can compute the diagonal of the Jacobian efficiently there.

Overall, I think the approaches that your group and ours are using should have significant impact on industrial practice, where assembled matrices (often with quadratic elements) are the norm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants