Parameterized abstractions that work on CPU and GPU architectures. #147

ali-ramadhan · 2019-03-21T20:48:45Z

This PR refactors many of the existing abstractions so that they can be passed to GPU kernels #59. It massively cleans up the operators and time-stepping loop, allowing us to keep using beautiful abstractions even in GPU kernels.

Note that after this PR is merged, CUDAnative#master and GPUifyLoops#vc/interactive will be required until the respective fixes are merged into an updated release. I will update the Project.toml when we can move to a new release.

Once this is merged I will also release v0.5.0.

When completed and merged this PR will:
Resolve #19
Resolve #59
Resolve #60
Resolve #66
Resolve #74
Resolve #133
Resolve #153

@Assert might get compiled out at certain optimization levels. It's really only for debugging.

codecov · 2019-03-21T23:45:44Z

Codecov Report

Merging #147 into master will increase coverage by 3.89%.
The diff coverage is 80.6%.

@@            Coverage Diff             @@
##           master     #147      +/-   ##
==========================================
+ Coverage   66.03%   69.92%   +3.89%     
==========================================
  Files          19       18       -1     
  Lines         630      645      +15     
==========================================
+ Hits          416      451      +35     
+ Misses        214      194      -20

Impacted Files	Coverage Δ
src/forcing.jl	`87.5% <ø> (ø)`	⬆️
src/clock.jl	`100% <100%> (ø)`	⬆️
src/planetary_constants.jl	`33.33% <100%> (ø)`	⬆️
src/output_writers.jl	`86.07% <100%> (+6.28%)`	⬆️
src/model_configuration.jl	`100% <100%> (ø)`	⬆️
src/fieldsets.jl	`100% <100%> (ø)`	⬆️
src/equation_of_state.jl	`100% <100%> (ø)`	⬆️
src/grids.jl	`94.59% <100%> (+1.49%)`	⬆️
src/poisson_solvers.jl	`34.78% <25%> (+1.44%)`	⬆️
src/fields.jl	`36.84% <48.57%> (-6.71%)`	⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbc8643...387aeda. Read the comment docs.

This removes the dependence on ::ModelMetadata and moves closer to an isbitstype Field abstraction. But will have to do a lot of refactoring now.

Skipping the field operations for now, we don't even use them...

Also getting rid of redundant test_ prefixes for all test functions.

ali-ramadhan · 2019-03-29T14:15:47Z

Marking this PR ready for review even though it is not ready so the GitLab CI pipeline will start checking this branch.

Nuked the two separate time stepping kernels.

…machine/Oceananigans.jl into isbitstype-abstractions

ali-ramadhan · 2019-04-03T02:29:00Z

I could keep cleaning things up but I think I've done enough to close #59 (and 6 other issues!).

The only big feature missing is turning our Field struct into something we can adapt/convert to a CUDA device argument using cudaconvert or Adapt.jl. We can do more cleanup when we figure out how to do this.

Note that tests will fail on dev/nightly builds (Julia 1.2) because something changed which broke Cassette (which GPUifyLoops depends on).

Will release v0.5.0 once this is merged.

ali-ramadhan · 2019-04-03T12:49:06Z

So I just ran the performance benchmarks again (different CPU but same GPU) and the model has gotten slower since the last time we ran benchmarks (PR #116), by about 30~40%.

Details pasted below but for the larger GPU models: 256^3 Float32 it went up from 85.9 ms to 119 ms, and for 128^3 Float32 it went up from 7.70 ms to 10.1 ms. These are wall clock time per model iteration.

To be fair, we had no boundary conditions API last time we benchmarked. I won't worry about this for now though as we haven't begun to optimize for performance.

──────────────────────────────────────────────────────────────────────────────────────────────────
             Oceananigans.jl benchmarks                    Time                   Allocations      
                                                   ──────────────────────   ───────────────────────
                 Tot / % measured:                       734s / 72.9%           18.7GiB / 0.06%    

 Section                                   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────────────────
 256x256x256 static ocean (CPU, Float32)       10     312s  58.4%   31.2s   60.2KiB  0.47%  6.02KiB
 256x256x256 static ocean (CPU, Float64)       10     196s  36.7%   19.6s   78.0KiB  0.62%  7.80KiB
 128x128x128 static ocean (CPU, Float32)       10    14.2s  2.66%   1.42s   60.2KiB  0.47%  6.02KiB
 128x128x128 static ocean (CPU, Float64)       10    7.64s  1.43%   764ms   78.0KiB  0.62%  7.80KiB
 256x256x256 static ocean (GPU, Float64)       10    1.36s  0.25%   136ms   1.60MiB  12.9%   164KiB
 256x256x256 static ocean (GPU, Float32)       10    1.19s  0.22%   119ms   1.36MiB  11.0%   139KiB
  64x 64x 64 static ocean (CPU, Float32)       10    950ms  0.18%  95.0ms   60.2KiB  0.47%  6.02KiB
  64x 64x 64 static ocean (CPU, Float64)       10    584ms  0.11%  58.4ms   78.0KiB  0.62%  7.80KiB
 128x128x128 static ocean (GPU, Float64)       10    116ms  0.02%  11.6ms   1.60MiB  12.9%   164KiB
 128x128x128 static ocean (GPU, Float32)       10    101ms  0.02%  10.1ms   1.36MiB  11.0%   139KiB
  32x 32x 32 static ocean (CPU, Float32)       10   94.6ms  0.02%  9.46ms   60.2KiB  0.47%  6.02KiB
  32x 32x 32 static ocean (CPU, Float64)       10   53.9ms  0.01%  5.39ms   78.0KiB  0.62%  7.80KiB
  64x 64x 64 static ocean (GPU, Float64)       10   11.6ms  0.00%  1.16ms   1.60MiB  12.9%   164KiB
  64x 64x 64 static ocean (GPU, Float32)       10   11.2ms  0.00%  1.12ms   1.36MiB  11.0%   139KiB
  32x 32x 32 static ocean (GPU, Float32)       10   5.61ms  0.00%   561μs   1.36MiB  11.0%   139KiB
  32x 32x 32 static ocean (GPU, Float64)       10   5.18ms  0.00%   518μs   1.60MiB  12.9%   164KiB
 ──────────────────────────────────────────────────────────────────────────────────────────────────

CPU Float64 -> Float32 speedup:
 32x 32x 32 static ocean: 0.570
 64x 64x 64 static ocean: 0.615
128x128x128 static ocean: 0.538
256x256x256 static ocean: 0.628

GPU Float64 -> Float32 speedup:
 32x 32x 32 static ocean: 0.923
 64x 64x 64 static ocean: 1.038
128x128x128 static ocean: 1.147
256x256x256 static ocean: 1.138

CPU -> GPU speedup:
 32x 32x 32 static ocean (Float32): 16.867
 32x 32x 32 static ocean (Float64): 10.414
 64x 64x 64 static ocean (Float32): 84.867
 64x 64x 64 static ocean (Float64): 50.283
128x128x128 static ocean (Float32): 139.956
128x128x128 static ocean (Float64): 65.665
256x256x256 static ocean (Float32): 261.842
256x256x256 static ocean (Float64): 144.460

vchuravy · 2019-04-03T14:27:06Z

Note that tests will fail on dev/nightly builds (Julia 1.2) because something changed which broke Cassette (which GPUifyLoops depends on).

Pushed a fix for that this morning, and also tagged GPUifyLoops v0.2.1

ali-ramadhan · 2019-04-03T14:30:34Z

Awesome, will update packages.

glwagner · 2019-04-04T09:33:11Z

Hmm... if the boundary conditions are default, the only thing that happens is a function is called which is something like f(args...) = nothing. That might get compiled away too (I was hoping so). So there's a bug in the boundary condition implementation if its causing a slowdown with default boundary conditions. 30-40% is too much slowdown for adding boundary conditions anyways I think.

ali-ramadhan · 2019-04-04T10:51:06Z

That's true, there's been a ton of refactoring so it's probably more than just one thing that's changed (I can't say for sure that my changes didn't slow things down).

No point worrying too much until we get around to profiling (#162).

…ions Parameterized abstractions that work on CPU and GPU architectures. Former-commit-id: 1594d9e

ali-ramadhan added 10 commits March 21, 2019 15:04

Proper grid tests for Float32.

6da23ca

Use exceptions instead of @Assert.

59eede5

@Assert might get compiled out at certain optimization levels. It's really only for debugging.

More grid constructor checks and tests.

c03f2ad

More strongly typed RegularCartesianGrid.

c8120ba

Architecture type for dispatching on model arch.

391a62a

Fix: commented out tests by mistake...

0ca0207

Update RegularCartesianGrid docstrings.

003b180

Added Architecture parameter to model.

5b54200

Remove leftover debug messages.

9030787

Get model tests running again.

1c797da

ali-ramadhan added performance 🏍️ So we can get the wrong answer even faster cleanup 🧹 Paying off technical debt abstractions 🎨 Whatever that means labels Mar 21, 2019

ali-ramadhan added this to the v0.5 milestone Mar 21, 2019

ali-ramadhan self-assigned this Mar 21, 2019

ali-ramadhan mentioned this pull request Mar 25, 2019

isbits CPU/GPU abstractions. CliMA/ClimateMachine.jl#130

Open

ali-ramadhan added 13 commits March 25, 2019 19:20

Properly export @hascuda macro.

09cf72c

Refactor Field structs.

e09806c

This removes the dependence on ::ModelMetadata and moves closer to an isbitstype Field abstraction. But will have to do a lot of refactoring now.

Refactored Field constructors.

30dbace

Remove redundant package import statement.

8e0eeff

Exporting Architectures, might be needed.

6a3c6ff

Refactoring grid initialization tests.

9c45d9f

Refactoring grid dimension tests.

1aa5c50

Refactored Field tests.

2fd57a8

Skipping the field operations for now, we don't even use them...

Some simple print statements to show test progress.

f1ca72d

Commas after spaces!

1546f0a

Renaming tests with Boolean sounding names.

9e107bb

Also getting rid of redundant test_ prefixes for all test functions.

Refactored Poisson solver tests.

baaf290

More Poisson solver tests for weird grid sizes.

c2386d3

ali-ramadhan marked this pull request as ready for review March 29, 2019 14:14

ali-ramadhan changed the title ~~Parameterized abstractions that work on CPU and GPU architectures.~~ [WIP] Parameterized abstractions that work on CPU and GPU architectures. Mar 29, 2019

ali-ramadhan added 4 commits March 29, 2019 11:02

Merge branch 'master' into isbitstype-abstractions

9f5dbb7

Some nicer operators!

92cfac8

More nice operators that use the Grid.

5242a05

Time stepping is now truly architecture agnostic!

85090e9

Nuked the two separate time stepping kernels.

ali-ramadhan mentioned this pull request Apr 2, 2019

How to reduce compile time for GPU code? #66

Closed

ali-ramadhan added 4 commits April 2, 2019 13:09

Updating CUDAnative and GPUifyLoops

566581c

Neater time stepping.

4512c88

Neater forcing function signature.

0c48884

Merge branch 'isbitstype-abstractions' of https://github.com/climate-…

4d6e736

…machine/Oceananigans.jl into isbitstype-abstractions

ali-ramadhan mentioned this pull request Apr 2, 2019

Change AB-2 parameter to 0.5 + χ to offer first order forward time stepping #160

Closed

ali-ramadhan added 3 commits April 3, 2019 01:14

Fixed a couple of kernels

378b912

Nuking old operators

33fded2

Nicer calls to boundary condition stuff

387aeda

ali-ramadhan requested a review from glwagner April 3, 2019 02:22

ali-ramadhan changed the title ~~[WIP] Parameterized abstractions that work on CPU and GPU architectures.~~ Parameterized abstractions that work on CPU and GPU architectures. Apr 3, 2019

ali-ramadhan mentioned this pull request Apr 3, 2019

More general CUDA thread block layout. #64

Closed

ali-ramadhan merged commit 1594d9e into master Apr 3, 2019

ali-ramadhan mentioned this pull request Apr 3, 2019

Profiling Oceananigans #162

Closed

ali-ramadhan deleted the isbitstype-abstractions branch April 3, 2019 14:30

ali-ramadhan mentioned this pull request Apr 9, 2019

Buoyancy update in linear time. #172

Merged

arcavaliere pushed a commit to arcavaliere/Oceananigans.jl that referenced this pull request Nov 6, 2019

Merge pull request CliMA#147 from climate-machine/isbitstype-abstract…

66627a8

…ions Parameterized abstractions that work on CPU and GPU architectures. Former-commit-id: 1594d9e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameterized abstractions that work on CPU and GPU architectures. #147

Parameterized abstractions that work on CPU and GPU architectures. #147

ali-ramadhan commented Mar 21, 2019 •

edited

Loading

codecov bot commented Mar 21, 2019 •

edited

Loading

ali-ramadhan commented Mar 29, 2019 •

edited

Loading

ali-ramadhan commented Apr 3, 2019 •

edited

Loading

ali-ramadhan commented Apr 3, 2019

vchuravy commented Apr 3, 2019

ali-ramadhan commented Apr 3, 2019

glwagner commented Apr 4, 2019

ali-ramadhan commented Apr 4, 2019 •

edited

Loading

Parameterized abstractions that work on CPU and GPU architectures. #147

Parameterized abstractions that work on CPU and GPU architectures. #147

Conversation

ali-ramadhan commented Mar 21, 2019 • edited Loading

codecov bot commented Mar 21, 2019 • edited Loading

Codecov Report

ali-ramadhan commented Mar 29, 2019 • edited Loading

ali-ramadhan commented Apr 3, 2019 • edited Loading

ali-ramadhan commented Apr 3, 2019

vchuravy commented Apr 3, 2019

ali-ramadhan commented Apr 3, 2019

glwagner commented Apr 4, 2019

ali-ramadhan commented Apr 4, 2019 • edited Loading

ali-ramadhan commented Mar 21, 2019 •

edited

Loading

codecov bot commented Mar 21, 2019 •

edited

Loading

ali-ramadhan commented Mar 29, 2019 •

edited

Loading

ali-ramadhan commented Apr 3, 2019 •

edited

Loading

ali-ramadhan commented Apr 4, 2019 •

edited

Loading