Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt Field, AveragedField, and ComputedField for GPU, round 2 #1057

Merged
merged 25 commits into from
Oct 17, 2020

Conversation

glwagner
Copy link
Member

@glwagner glwagner commented Oct 13, 2020

This PR writes new Adapt.adapt_structure methods for Field, AveragedField, and ComputedField:

  • Field and ComputedField are adapted to their data (thus shedding location information, the grid, and boundary conditions). This is fine because we don't reference location information or boundary conditions inside GPU kernels

  • AveragedField sheds operand and grid when adapted to the GPU. AveragedField still needs location information for getindex to work correctly.

This obviates the need for datatuple (we still keep the function around however because its useful for tests). It also obviates the need for gpufriendly.

We can now use AveragedField and ComputedField inside kernels. This still doesn't work. We need to open an issue once this PR is merged.

This PR supersedes #746 .

Finally, we can dramatically simplify the time-stepping routine since we don't need to "unwrap" fields anymore.

It's probably worthwhile running a benchmark before merging but hopefully there's no issue.

Resolves #722 .

@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #1057 into master will decrease coverage by 0.21%.
The diff coverage is 69.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1057      +/-   ##
==========================================
- Coverage   57.70%   57.49%   -0.22%     
==========================================
  Files         158      161       +3     
  Lines        3807     3807              
==========================================
- Hits         2197     2189       -8     
- Misses       1610     1618       +8     
Impacted Files Coverage Δ
src/AbstractOperations/AbstractOperations.jl 50.00% <ø> (ø)
src/AbstractOperations/show_abstract_operations.jl 13.04% <0.00%> (-0.60%) ⬇️
src/Buoyancy/buoyancy_field.jl 61.76% <0.00%> (-2.95%) ⬇️
src/Fields/abstract_field.jl 57.14% <0.00%> (-0.86%) ⬇️
src/Fields/averaged_field.jl 77.77% <ø> (+7.77%) ⬆️
src/Fields/computed_field.jl 64.28% <0.00%> (ø)
src/Fields/field.jl 82.35% <0.00%> (-5.89%) ⬇️
src/Fields/show_fields.jl 0.00% <0.00%> (ø)
src/Operators/laplacian_operators.jl 9.09% <ø> (ø)
src/TimeSteppers/TimeSteppers.jl 80.00% <ø> (ø)
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5f47e0...c35af73. Read the comment docs.

Copy link
Member

@ali-ramadhan ali-ramadhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static ocean benchmarks (see below) show no performance regression on GPUs. Actually, it seems that CPU models are actually ~30% faster now 🎉

Side note: Some potential performance regressions may not be caught by benchmark_static_ocean.jl. I think we should merge this PR as the static ocean benchmarks would test whether adapting Field introduces performance regressions.

I'm hoping to refactor the benchmarks to reduce boilerplate and produce more useful statistics/tables. As part of that I'll add a more comprehensive benchmark that tests benchmarking with an LES closure, output writing, time averaging, etc.


Environment:

Oceananigans v0.42.0 (DEVELOPMENT BRANCH)
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)
  GPU: TITAN V

Static ocean benchmarks from master branch:

        Static ocean benchmarks                Time                   Allocations      
                                       ──────────────────────   ───────────────────────
           Tot / % measured:                 448s / 28.2%           31.2GiB / 0.40%    

 Section                       ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
  16× 16× 16  [CPU, Float32]       10   25.9ms  0.02%  2.59ms   3.40MiB  2.68%   348KiB
  16× 16× 16  [CPU, Float64]       10   32.1ms  0.03%  3.21ms   3.40MiB  2.68%   348KiB
  16× 16× 16  [GPU, Float32]       10   41.2ms  0.03%  4.12ms   9.28MiB  7.32%   950KiB
  16× 16× 16  [GPU, Float64]       10   45.3ms  0.04%  4.53ms   9.28MiB  7.32%   950KiB
  32× 32× 32  [CPU, Float32]       10    120ms  0.09%  12.0ms   3.40MiB  2.68%   348KiB
  32× 32× 32  [CPU, Float64]       10    117ms  0.09%  11.7ms   3.40MiB  2.68%   348KiB
  32× 32× 32  [GPU, Float32]       10   63.0ms  0.05%  6.30ms   9.28MiB  7.32%   950KiB
  32× 32× 32  [GPU, Float64]       10   41.1ms  0.03%  4.11ms   9.29MiB  7.32%   951KiB
  64× 64× 64  [CPU, Float32]       10    675ms  0.53%  67.5ms   3.40MiB  2.68%   348KiB
  64× 64× 64  [CPU, Float64]       10    705ms  0.56%  70.5ms   3.40MiB  2.68%   348KiB
  64× 64× 64  [GPU, Float32]       10   42.7ms  0.03%  4.27ms   9.28MiB  7.32%   950KiB
  64× 64× 64  [GPU, Float64]       10   43.7ms  0.03%  4.37ms   9.29MiB  7.32%   951KiB
 128×128×128  [CPU, Float32]       10    5.85s  4.64%   585ms   3.40MiB  2.68%   348KiB
 128×128×128  [CPU, Float64]       10    5.23s  4.14%   523ms   3.40MiB  2.68%   348KiB
 128×128×128  [GPU, Float32]       10   57.8ms  0.05%  5.78ms   9.28MiB  7.32%   951KiB
 128×128×128  [GPU, Float64]       10   53.5ms  0.04%  5.35ms   9.29MiB  7.32%   951KiB
 256×256×256  [CPU, Float32]       10    58.5s  46.4%   5.85s   3.40MiB  2.68%   348KiB
 256×256×256  [CPU, Float64]       10    53.9s  42.7%   5.39s   3.40MiB  2.68%   348KiB
 256×256×256  [GPU, Float32]       10    317ms  0.25%  31.7ms   9.32MiB  7.35%   955KiB
 256×256×256  [GPU, Float64]       10    321ms  0.25%  32.1ms   9.29MiB  7.32%   951KiB
 ──────────────────────────────────────────────────────────────────────────────────────

Static ocean benchmarks from glw/adapt-field-round-2 branch:

 ──────────────────────────────────────────────────────────────────────────────────────
        Static ocean benchmarks                Time                   Allocations      
                                       ──────────────────────   ───────────────────────
           Tot / % measured:                 369s / 25.7%           31.0GiB / 0.36%    

 Section                       ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
  16× 16× 16  [CPU, Float32]       10   24.4ms  0.03%  2.44ms   2.87MiB  2.49%   293KiB
  16× 16× 16  [CPU, Float64]       10   25.3ms  0.03%  2.53ms   2.87MiB  2.49%   293KiB
  16× 16× 16  [GPU, Float32]       10   40.3ms  0.04%  4.03ms   8.63MiB  7.50%   884KiB
  16× 16× 16  [GPU, Float64]       10   38.3ms  0.04%  3.83ms   8.63MiB  7.50%   884KiB
  32× 32× 32  [CPU, Float32]       10   74.6ms  0.08%  7.46ms   2.87MiB  2.49%   293KiB
  32× 32× 32  [CPU, Float64]       10   72.4ms  0.08%  7.24ms   2.87MiB  2.49%   293KiB
  32× 32× 32  [GPU, Float32]       10   63.5ms  0.07%  6.35ms   8.64MiB  7.50%   884KiB
  32× 32× 32  [GPU, Float64]       10   44.6ms  0.05%  4.46ms   8.64MiB  7.51%   885KiB
  64× 64× 64  [CPU, Float32]       10    527ms  0.56%  52.7ms   2.87MiB  2.49%   293KiB
  64× 64× 64  [CPU, Float64]       10    648ms  0.68%  64.8ms   2.87MiB  2.49%   293KiB
  64× 64× 64  [GPU, Float32]       10   40.5ms  0.04%  4.05ms   8.64MiB  7.50%   884KiB
  64× 64× 64  [GPU, Float64]       10   50.8ms  0.05%  5.08ms   8.64MiB  7.51%   885KiB
 128×128×128  [CPU, Float32]       10    4.86s  5.13%   486ms   2.87MiB  2.49%   293KiB
 128×128×128  [CPU, Float64]       10    3.93s  4.15%   393ms   2.87MiB  2.49%   293KiB
 128×128×128  [GPU, Float32]       10    128ms  0.13%  12.8ms   8.65MiB  7.52%   886KiB
 128×128×128  [GPU, Float64]       10   46.8ms  0.05%  4.68ms   8.64MiB  7.51%   885KiB
 256×256×256  [CPU, Float32]       10    43.0s  45.3%   4.30s   2.87MiB  2.49%   293KiB
 256×256×256  [CPU, Float64]       10    40.6s  42.8%   4.06s   2.87MiB  2.49%   293KiB
 256×256×256  [GPU, Float32]       10    317ms  0.33%  31.7ms   8.68MiB  7.54%   889KiB
 256×256×256  [GPU, Float64]       10    322ms  0.34%  32.2ms   8.65MiB  7.51%   885KiB
 ──────────────────────────────────────────────────────────────────────────────────────

   │   │   └── OffsetArrays.OffsetArray{Float64,3,Array{Float64,3}}
   │   └── / at (Cell, Cell, Cell) via Oceananigans.AbstractOperations.identity
* at (Cell, Cell, Cell) via identity
   ├── 0.3333333333333333
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea: We can pretty print rational numbers that show up in abstract operations

julia> rationalize(0.3333333333333333)
1//3

but perhaps this is misleading as Julia is actually multiplying by 0.3333333333333333 and not 1//3.

So probably the best thing to do is just print with eltype(model).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we can also truncate floating point numbers to fewer significant digits by redefining tree_show(a::Number, depth, nesting):

tree_show(a::Union{Number, Function}, depth, nesting) = string(a)

@glwagner glwagner merged commit 12435ce into master Oct 17, 2020
@ali-ramadhan ali-ramadhan mentioned this pull request Oct 17, 2020
@navidcy navidcy deleted the glw/adapt-field-round-2 branch May 27, 2021 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible elegant solution for compiling kernels with fields as arguments
2 participants