Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterized abstractions that work on CPU and GPU architectures. #147

Merged
merged 66 commits into from
Apr 3, 2019

Conversation

ali-ramadhan
Copy link
Member

@ali-ramadhan ali-ramadhan commented Mar 21, 2019

This PR refactors many of the existing abstractions so that they can be passed to GPU kernels #59. It massively cleans up the operators and time-stepping loop, allowing us to keep using beautiful abstractions even in GPU kernels.

Note that after this PR is merged, CUDAnative#master and GPUifyLoops#vc/interactive will be required until the respective fixes are merged into an updated release. I will update the Project.toml when we can move to a new release.

Once this is merged I will also release v0.5.0.

When completed and merged this PR will:
Resolve #19
Resolve #59
Resolve #60
Resolve #66
Resolve #74
Resolve #133
Resolve #153

@ali-ramadhan ali-ramadhan added performance 🏍️ So we can get the wrong answer even faster cleanup 🧹 Paying off technical debt abstractions 🎨 Whatever that means labels Mar 21, 2019
@ali-ramadhan ali-ramadhan added this to the v0.5 milestone Mar 21, 2019
@ali-ramadhan ali-ramadhan self-assigned this Mar 21, 2019
@codecov
Copy link

codecov bot commented Mar 21, 2019

Codecov Report

Merging #147 into master will increase coverage by 3.89%.
The diff coverage is 80.6%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #147      +/-   ##
==========================================
+ Coverage   66.03%   69.92%   +3.89%     
==========================================
  Files          19       18       -1     
  Lines         630      645      +15     
==========================================
+ Hits          416      451      +35     
+ Misses        214      194      -20
Impacted Files Coverage Δ
src/forcing.jl 87.5% <ø> (ø) ⬆️
src/clock.jl 100% <100%> (ø) ⬆️
src/planetary_constants.jl 33.33% <100%> (ø) ⬆️
src/output_writers.jl 86.07% <100%> (+6.28%) ⬆️
src/model_configuration.jl 100% <100%> (ø) ⬆️
src/fieldsets.jl 100% <100%> (ø) ⬆️
src/equation_of_state.jl 100% <100%> (ø) ⬆️
src/grids.jl 94.59% <100%> (+1.49%) ⬆️
src/poisson_solvers.jl 34.78% <25%> (+1.44%) ⬆️
src/fields.jl 36.84% <48.57%> (-6.71%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbc8643...387aeda. Read the comment docs.

@ali-ramadhan ali-ramadhan marked this pull request as ready for review March 29, 2019 14:14
@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Mar 29, 2019

Marking this PR ready for review even though it is not ready so the GitLab CI pipeline will start checking this branch.

@ali-ramadhan ali-ramadhan changed the title Parameterized abstractions that work on CPU and GPU architectures. [WIP] Parameterized abstractions that work on CPU and GPU architectures. Mar 29, 2019
@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Apr 3, 2019

I could keep cleaning things up but I think I've done enough to close #59 (and 6 other issues!).

The only big feature missing is turning our Field struct into something we can adapt/convert to a CUDA device argument using cudaconvert or Adapt.jl. We can do more cleanup when we figure out how to do this.

Note that tests will fail on dev/nightly builds (Julia 1.2) because something changed which broke Cassette (which GPUifyLoops depends on).

Will release v0.5.0 once this is merged.

@ali-ramadhan ali-ramadhan changed the title [WIP] Parameterized abstractions that work on CPU and GPU architectures. Parameterized abstractions that work on CPU and GPU architectures. Apr 3, 2019
@ali-ramadhan
Copy link
Member Author

So I just ran the performance benchmarks again (different CPU but same GPU) and the model has gotten slower since the last time we ran benchmarks (PR #116), by about 30~40%.

Details pasted below but for the larger GPU models: 256^3 Float32 it went up from 85.9 ms to 119 ms, and for 128^3 Float32 it went up from 7.70 ms to 10.1 ms. These are wall clock time per model iteration.

To be fair, we had no boundary conditions API last time we benchmarked. I won't worry about this for now though as we haven't begun to optimize for performance.

──────────────────────────────────────────────────────────────────────────────────────────────────
             Oceananigans.jl benchmarks                    Time                   Allocations      
                                                   ──────────────────────   ───────────────────────
                 Tot / % measured:                       734s / 72.9%           18.7GiB / 0.06%    

 Section                                   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────────────────
 256x256x256 static ocean (CPU, Float32)       10     312s  58.4%   31.2s   60.2KiB  0.47%  6.02KiB
 256x256x256 static ocean (CPU, Float64)       10     196s  36.7%   19.6s   78.0KiB  0.62%  7.80KiB
 128x128x128 static ocean (CPU, Float32)       10    14.2s  2.66%   1.42s   60.2KiB  0.47%  6.02KiB
 128x128x128 static ocean (CPU, Float64)       10    7.64s  1.43%   764ms   78.0KiB  0.62%  7.80KiB
 256x256x256 static ocean (GPU, Float64)       10    1.36s  0.25%   136ms   1.60MiB  12.9%   164KiB
 256x256x256 static ocean (GPU, Float32)       10    1.19s  0.22%   119ms   1.36MiB  11.0%   139KiB
  64x 64x 64 static ocean (CPU, Float32)       10    950ms  0.18%  95.0ms   60.2KiB  0.47%  6.02KiB
  64x 64x 64 static ocean (CPU, Float64)       10    584ms  0.11%  58.4ms   78.0KiB  0.62%  7.80KiB
 128x128x128 static ocean (GPU, Float64)       10    116ms  0.02%  11.6ms   1.60MiB  12.9%   164KiB
 128x128x128 static ocean (GPU, Float32)       10    101ms  0.02%  10.1ms   1.36MiB  11.0%   139KiB
  32x 32x 32 static ocean (CPU, Float32)       10   94.6ms  0.02%  9.46ms   60.2KiB  0.47%  6.02KiB
  32x 32x 32 static ocean (CPU, Float64)       10   53.9ms  0.01%  5.39ms   78.0KiB  0.62%  7.80KiB
  64x 64x 64 static ocean (GPU, Float64)       10   11.6ms  0.00%  1.16ms   1.60MiB  12.9%   164KiB
  64x 64x 64 static ocean (GPU, Float32)       10   11.2ms  0.00%  1.12ms   1.36MiB  11.0%   139KiB
  32x 32x 32 static ocean (GPU, Float32)       10   5.61ms  0.00%   561μs   1.36MiB  11.0%   139KiB
  32x 32x 32 static ocean (GPU, Float64)       10   5.18ms  0.00%   518μs   1.60MiB  12.9%   164KiB
 ──────────────────────────────────────────────────────────────────────────────────────────────────

CPU Float64 -> Float32 speedup:
 32x 32x 32 static ocean: 0.570
 64x 64x 64 static ocean: 0.615
128x128x128 static ocean: 0.538
256x256x256 static ocean: 0.628

GPU Float64 -> Float32 speedup:
 32x 32x 32 static ocean: 0.923
 64x 64x 64 static ocean: 1.038
128x128x128 static ocean: 1.147
256x256x256 static ocean: 1.138

CPU -> GPU speedup:
 32x 32x 32 static ocean (Float32): 16.867
 32x 32x 32 static ocean (Float64): 10.414
 64x 64x 64 static ocean (Float32): 84.867
 64x 64x 64 static ocean (Float64): 50.283
128x128x128 static ocean (Float32): 139.956
128x128x128 static ocean (Float64): 65.665
256x256x256 static ocean (Float32): 261.842
256x256x256 static ocean (Float64): 144.460

@ali-ramadhan ali-ramadhan merged commit 1594d9e into master Apr 3, 2019
@vchuravy
Copy link
Collaborator

vchuravy commented Apr 3, 2019

Note that tests will fail on dev/nightly builds (Julia 1.2) because something changed which broke Cassette (which GPUifyLoops depends on).

Pushed a fix for that this morning, and also tagged GPUifyLoops v0.2.1

@ali-ramadhan ali-ramadhan deleted the isbitstype-abstractions branch April 3, 2019 14:30
@ali-ramadhan
Copy link
Member Author

Awesome, will update packages.

@glwagner
Copy link
Member

glwagner commented Apr 4, 2019

Hmm... if the boundary conditions are default, the only thing that happens is a function is called which is something like f(args...) = nothing. That might get compiled away too (I was hoping so). So there's a bug in the boundary condition implementation if its causing a slowdown with default boundary conditions. 30-40% is too much slowdown for adding boundary conditions anyways I think.

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Apr 4, 2019

That's true, there's been a ton of refactoring so it's probably more than just one thing that's changed (I can't say for sure that my changes didn't slow things down).

No point worrying too much until we get around to profiling (#162).

arcavaliere pushed a commit to arcavaliere/Oceananigans.jl that referenced this pull request Nov 6, 2019
…ions

Parameterized abstractions that work on CPU and GPU architectures.

Former-commit-id: 1594d9e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment