Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECP 5: Deploy production sliding mesh capability with linear solver benchmarking #5

Closed
spdomin opened this issue Oct 7, 2016 · 13 comments
Assignees

Comments

@spdomin
Copy link
Contributor

spdomin commented Oct 7, 2016

Activities:

  1. Improve baseline sliding mesh capability at curved surfaces.
  2. Evaluate ATDM-based parallel search methods.
  3. Establish matrix set-up cost timings.
  4. Evaluate possible lagging of matrix update.
  5. Evaluate reduction of matrix system by omitting moving block column entries in favor of multiple matrix assembly/solve iterations.
@spdomin spdomin added this to the FY18Q1 milestone Oct 7, 2016
@spdomin
Copy link
Contributor Author

spdomin commented Oct 7, 2016

@srajama1 I am working on obtaining a patch for the new ATDM-based search. Once I have that, you can help out on establishing search efficiencies.

@srajama1
Copy link

srajama1 commented Oct 7, 2016

Thanks, I was talking to Nate about it. Let me know how I can help.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 7, 2016

@mbarone81 let's add your overset work to this as well with the hope that this milestone will define the path forward for blade motion.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 11, 2016

@alanw0, take a look at commit dbd1b958a52f82b0d3209ccb4b4d7c621016e62d for a new test to start profiling for NonConformalManager ghosting costs. This should replace the effort on edgeContact3D.

@alanw0
Copy link
Contributor

alanw0 commented Oct 11, 2016

ok got it, I'll take a look at the dgNonConformalEdgeCylinder test.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 17, 2016

@srajama1, could you please keep track of the ATDM-based search and test it once it is confirmed that point/box has been deployed? I need to start working on the kokkos algorithm structure task. Thanks.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 19, 2016

@NaluCFD/sliding I have the higher order DG scheme working. It also naturally allows for the P=1/P=2 interface. I will perform some more P=2 sliding mesh sims and commit soon.

@spdomin
Copy link
Contributor Author

spdomin commented Oct 19, 2016

Hex8/Hex27 or Hex27/Hex27 is now completed: commit d2adbe8

@spdomin
Copy link
Contributor Author

spdomin commented Nov 22, 2016

@NaluCFD/sliding, here is a sample timing for a 150 million element 1024 job (run 100 steps with two Picard loops).

32 node (36 core per node):

*******************************************************
Simulation Shall Complete: time/timestep: 0.0102493/100
*******************************************************
-------------------------------- 
Begin Timer Overview for Realm: realm_1
-------------------------------- 
Timing for Eq: myLowMach
             init --   	avg: 0.000184042 	min: 4.22001e-05 	max: 0.00332427
         assemble --   	avg: 0 	min: 0 	max: 0
    load_complete --   	avg: 0 	min: 0 	max: 0
            solve --   	avg: 0 	min: 0 	max: 0
    precond setup --   	avg: 0 	min: 0 	max: 0
             misc --   	avg: 18.7294 	min: 15.4274 	max: 19.9338
Timing for Eq: MomentumEQS
             init --   	avg: 431.72 	min: 428.433 	max: 451.232
         assemble --   	avg: 482.962 	min: 465.899 	max: 576.557
    load_complete --   	avg: 123.853 	min: 28.7052 	max: 133.94
            solve --   	avg: 583.758 	min: 583.597 	max: 589.672
    precond setup --   	avg: 0.0177482 	min: 0.011241 	max: 0.0544529
             misc --   	avg: 67.2906 	min: 66.379 	max: 68.3661
linear iterations --  	avg: 11.79 	min: 7 	max: 34
Timing for Eq: ContinuityEQS
             init --   	avg: 201.045 	min: 183.683 	max: 204.34
         assemble --   	avg: 151.122 	min: 142.574 	max: 177.107
    load_complete --   	avg: 30.9594 	min: 4.86764 	max: 33.8108
            solve --   	avg: 3142.5 	min: 3142.45 	max: 3155.93
    precond setup --   	avg: 22.3307 	min: 22.329 	max: 22.3356
             misc --   	avg: 97.5644 	min: 83.7688 	max: 98.3748
linear iterations --  	avg: 38.11 	min: 27 	max: 50
Timing for Eq: myZ
             init --   	avg: 190.661 	min: 190.292 	max: 191.43
         assemble --   	avg: 179.576 	min: 160.99 	max: 204.665
    load_complete --   	avg: 30.0869 	min: 4.99243 	max: 32.84
            solve --   	avg: 58.9669 	min: 58.8958 	max: 68.3101
    precond setup --   	avg: 0.00417599 	min: 0.00237584 	max: 0.0218868
             misc --   	avg: 18.8389 	min: 18.4308 	max: 19.505
linear iterations --  	avg: 8.28 	min: 6 	max: 10
Timing for IO: 
   io create mesh --   	avg: 0.363296 	min: 0.191619 	max: 0.527161
 io output fields --   	avg: 57.5503 	min: 56.8373 	max: 58.4367
 io populate mesh --   	avg: 4.6819 	min: 4.6608 	max: 4.70148
 io populate fd   --   	avg: 0.256733 	min: 0.0831389 	max: 0.430451
Timing for connectivity/finalize lysys: 
         eqs init --   	avg: 823.427 	min: 820.799 	max: 827.33
Timing for property evaluation:         
            props --   	avg: 0.0918778 	min: 0.0545573 	max: 0.310776
Timing for Contact: 
       contact bc --   	avg: 15.1264 	min: 14.6959 	max: 18.4114

Timing for Simulation: nprocs= 1152
           main() --   	avg: 5880.26 	min: 5840.17 	max: 5887.04
Memory Overview: 
nalu memory: total (over all cores) current/high-water mark=       513.083 G      536.876 G
nalu memory:   min (over all cores) current/high-water mark=       256.641 M      266.148 M
nalu memory:   max (over all cores) current/high-water mark=       1.89328 G      2.04586 G
Min High-water memory usage 266.1 MB
Avg High-water memory usage 477.2 MB
Max High-water memory usage 2095.0 MB

Min Available memory per processor 1789.2 MB
Avg Available memory per processor 1789.2 MB
Max Available memory per processor 1789.2 MB

Min No-output time 5787.6 sec
Avg No-output time 5829.7 sec
Max No-output time 5833.2 sec

STKPERF: Total Time: 5841.7

STKPERF: Current memory: 357113856 (340.6 M)
STKPERF: Memory high water: 374874112 (357.5 M)

64 node (36 core per node):

*******************************************************
Simulation Shall Complete: time/timestep: 0.0102493/100
*******************************************************
-------------------------------- 
Begin Timer Overview for Realm: realm_1
-------------------------------- 
Timing for Eq: myLowMach
             init --   	avg: 9.29431e-05 	min: 3.31402e-05 	max: 0.000857592
         assemble --   	avg: 0 	min: 0 	max: 0
    load_complete --   	avg: 0 	min: 0 	max: 0
            solve --   	avg: 0 	min: 0 	max: 0
    precond setup --   	avg: 0 	min: 0 	max: 0
             misc --   	avg: 10.332 	min: 7.72162 	max: 11.2043
Timing for Eq: MomentumEQS
             init --   	avg: 239.982 	min: 237.399 	max: 253.78
         assemble --   	avg: 240.033 	min: 231.406 	max: 314.129
    load_complete --   	avg: 97.0818 	min: 21.3585 	max: 102.162
            solve --   	avg: 330.231 	min: 330.093 	max: 330.599
    precond setup --   	avg: 0.00849527 	min: 0.00510311 	max: 0.0406508
             misc --   	avg: 34.181 	min: 33.6794 	max: 34.9966
linear iterations --  	avg: 12.285 	min: 7 	max: 34
Timing for Eq: ContinuityEQS
             init --   	avg: 119.214 	min: 106.893 	max: 121.829
         assemble --   	avg: 72.407 	min: 70.6553 	max: 93.4898
    load_complete --   	avg: 24.731 	min: 3.5701 	max: 26.1621
            solve --   	avg: 1910.76 	min: 1910.73 	max: 1911.08
    precond setup --   	avg: 12.9936 	min: 12.9926 	max: 12.9988
             misc --   	avg: 44.6586 	min: 44.1545 	max: 45.364
linear iterations --  	avg: 42.01 	min: 32 	max: 50
Timing for Eq: myZ
             init --   	avg: 108.232 	min: 107.934 	max: 108.653
         assemble --   	avg: 81.3621 	min: 79.6523 	max: 101.46
    load_complete --   	avg: 23.6941 	min: 3.58118 	max: 25.093
            solve --   	avg: 35.8702 	min: 35.8191 	max: 36.088
    precond setup --   	avg: 0.00200497 	min: 0.00114703 	max: 0.0113389
             misc --   	avg: 9.74541 	min: 9.52759 	max: 10.3505
linear iterations --  	avg: 9.445 	min: 6 	max: 10
Timing for IO: 
   io create mesh --   	avg: 0.748922 	min: 0.388414 	max: 0.995598
 io output fields --   	avg: 26.2067 	min: 25.7432 	max: 26.8175
 io populate mesh --   	avg: 4.66858 	min: 4.6314 	max: 4.70515
 io populate fd   --   	avg: 0.406266 	min: 0.152544 	max: 0.774392
Timing for connectivity/finalize lysys: 
         eqs init --   	avg: 467.428 	min: 465.682 	max: 469.464
Timing for property evaluation:         
            props --   	avg: 0.0555738 	min: 0.0349991 	max: 0.15305
Timing for Contact: 
       contact bc --   	avg: 11.7654 	min: 11.5483 	max: 14.2255

Timing for Simulation: nprocs= 2304
           main() --   	avg: 3422.38 	min: 3401.18 	max: 3425.24
Memory Overview: 
nalu memory: total (over all cores) current/high-water mark=       645.294 G      674.343 G
nalu memory:   min (over all cores) current/high-water mark=       185.172 M      193.027 M
nalu memory:   max (over all cores) current/high-water mark=       1.07852 G      1.14824 G
Min High-water memory usage 193.0 MB
Avg High-water memory usage 299.7 MB
Max High-water memory usage 1175.8 MB

Min Available memory per processor 1789.2 MB
Avg Available memory per processor 1789.2 MB
Max Available memory per processor 1789.2 MB

Min No-output time 3396.1 sec
Avg No-output time 3398.5 sec
Max No-output time 3401.0 sec

STKPERF: Total Time: 3420.3

@alanw0
Copy link
Contributor

alanw0 commented Nov 22, 2016

It's interesting to notice the details of the timings, particularly the difference between min and max for particular lines which indicates imbalance, but it's hard to say whether it's an imbalance of the elements, or work (e.g. localized work like search/contact), or imbalance of ownership of shared nodes which would affect linear-solver work since owned nodes tend to correspond to number of matrix rows per proc.

In these timings the assemble looks pretty well balanced which may indicate the elements are well balanced. The solve time looks balanced but that could be because it includes sync points (like dots/norms) which forces the overall solve time to appear balanced. The load-complete time is distinctly imbalanced, which may be the most direct symptom of an imbalance among shared nodes causing uneven numbers of matrix rows per proc.

@spdomin
Copy link
Contributor Author

spdomin commented Nov 23, 2016

Exactly. This is a hybrid mesh. In general, for these types of meshes we find almost perfect elemental balances while the node balance is generally poor. Aero found this as well and changed the manner by which node ownership is processed (round robin rather than lowest rank). We probably can consider something similar to make sure that the rows are well balanced.

@spdomin
Copy link
Contributor Author

spdomin commented Jan 4, 2017

Latest push by @alanw0 provides the following differences:

First, the quantity of ghosting has gone down:

Old:

NonConformal alg will ghost a number of entities: 5285506

New:

NonConformal alg will ghost a new number of entities: 1242 and remove 12461 entities from ghosting.
Timing also improved (see push):

4cca5ba

@spdomin
Copy link
Contributor Author

spdomin commented Feb 23, 2017

Transition to Jira.

@spdomin spdomin closed this as completed Feb 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants