Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes writing of 3d-reduced Fields to NetCDF #2865

Merged
merged 13 commits into from
Feb 8, 2023
Merged

Conversation

tomchor
Copy link
Collaborator

@tomchor tomchor commented Jan 14, 2023

This fixes the error we were getting when writing Fields reduced over 3 dimensions to disk with NetCDFOutputWriter according to the upstream provided in a PR at NCDatasets: Alexander-Barth/NCDatasets.jl#197.

This PR also adds a test to catch this in the future.

For now this is only working on the master branch of NCDatasets so tests should fail for now, but once a new version of NCDatasets is released I'll update the packages.

Closes #2857

@tomchor tomchor marked this pull request as draft January 14, 2023 02:19
@glwagner
Copy link
Member

glwagner commented Jan 14, 2023

Can you use a descriptive link to the PR / issue in NCDatasets (rather than just "here")?

@glwagner
Copy link
Member

glwagner commented Jan 14, 2023

The key issue is Alexander-Barth/NCDatasets.jl#197

@tomchor
Copy link
Collaborator Author

tomchor commented Jan 31, 2023

We're getting errors when running the tests on buildkite that I'm not getting when running on a GPU locally.

For example this:

Field boundary conditions [GPU, RectilinearGrid]: Test Failed at /net/ocean/home/data44/data5/glwagner/.buildkite-agent/builds/sverdrup-7/clima/oceananigans/test/test_computed_field.jl:475
--
  | Expression: #= /net/ocean/home/data44/data5/glwagner/.buildkite-agent/builds/sverdrup-7/clima/oceananigans/test/test_computed_field.jl:475 =# CUDA.@allowscalar all(ST.data[1:Nx, 1:Ny, 0] .== ST.data[1:Nx, 1:Ny, 1])

The above line works for me on GPU when doing include("test_computed_field.jl"), but for some reason fails on buildkite.

@glwagner @simone-silvestri any ideas as to why? The fact that I can't reproduce these errors locally is making it hard for me to solve them

@. u.data = 1 + rand()
@. v.data = 2 + rand()
@. w.data = 3 + rand()
CUDA.@allowscalar begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to do this without @allowscalar? We should be reducing where allowscalar appears in tests, not adding new tests with this.

@@ -301,7 +303,7 @@ function computations_with_averaged_field_derivative(model)

set!(model, T = (x, y, z) -> 3 * z)

return all(interior(shear)[2:3, 2:3, 2:3] .== interior(T)[2:3, 2:3, 2:3])
return CUDA.@allowscalar all(interior(shear)[2:3, 2:3, 2:3] .== interior(T)[2:3, 2:3, 2:3])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. We should minimize usage of @allowscalar in tests. This is one of the largest sources of technical debt in our tests and has incurred a lot of pain in the past

@@ -320,7 +322,7 @@ function computations_with_computed_fields(model)
tke = Field(tke_op)
compute!(tke)

return all(interior(tke)[2:3, 2:3, 2:3] .== 9/2)
return CUDA.@allowscalar all(interior(tke)[2:3, 2:3, 2:3] .== 9/2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@glwagner
Copy link
Member

glwagner commented Feb 1, 2023

There are a lot of new instances of @allowscalar, but rather than adding new instances we should be refactoring the tests so they don't appear.

When we find that we have to use @allowscalar, it often indicates that our Field infrastructure is somehow deficient / doesn't support necessary operations, which causes us to resort to indexing and other syntax that requires @allowscalar.

@navidcy @simone-silvestri

@tomchor
Copy link
Collaborator Author

tomchor commented Feb 1, 2023

There are a lot of new instances of @allowscalar, but rather than adding new instances we should be refactoring the tests so they don't appear.

I added these because it was the only way to make tests pass locally. However, I can't fully reproduce tests results locally anyway, like I mentioned in my previous comment, so these may well be unnecessary (since these lines might be passing on buildkite).

@@ -320,7 +322,7 @@ function computations_with_computed_fields(model)
tke = Field(tke_op)
compute!(tke)

return all(interior(tke)[2:3, 2:3, 2:3] .== 9/2)
return CUDA.@allowscalar all(interior(tke)[2:3, 2:3, 2:3] .== 9/2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return CUDA.@allowscalar all(interior(tke)[2:3, 2:3, 2:3] .== 9/2)
return all(interior(tke, 2:3, 2:3, 2:3) .== 9/2)

@glwagner
Copy link
Member

glwagner commented Feb 1, 2023

I suggested a syntax change that could help

@tomchor
Copy link
Collaborator Author

tomchor commented Feb 1, 2023

I suggested a syntax change that could help

You're right it does help. I'll try replacing these one by one and see if that helps with the error on buildkite. Although if would be useful to figure out why I'm not getting the same errors locally.

@glwagner
Copy link
Member

glwagner commented Feb 1, 2023

Awesome!

@tomchor
Copy link
Collaborator Author

tomchor commented Feb 2, 2023

Apparently the new syntax does help avoid @allowscalar instances, and things do compile locally for me, but the errors on buildkite are still there:

Computations with Averaged Fields [GPU, RectilinearGrid]: Test Failed at /net/ocean/home/data44/data5/glwagner/.buildkite-agent/builds/sverdrup-13/clima/oceananigans/test/test_computed_field.jl:583
--
  | Expression: all(interior(tke_yz) .== 9 / 2)

Any ideas on what might be the cause of the differences between builkite and my local server? If someone could also run one of the failing tests on a GPU locally and see if they get the same errors that buildkite is throwing, that would be helpful.

@tomchor tomchor requested a review from glwagner February 5, 2023 15:40
@tomchor
Copy link
Collaborator Author

tomchor commented Feb 5, 2023

Finally got the tests passing! It was something having to do with GPUCompiler.jl.

This is ready to merge/review.

@tomchor
Copy link
Collaborator Author

tomchor commented Feb 8, 2023

@navidcy @simone-silvestri @glwagner with #2899 being merged, this bugfix PR is pretty trivial. Can I get a review whenever any of you have the time?

@tomchor tomchor merged commit 4940d29 into main Feb 8, 2023
@tomchor tomchor deleted the tc/reduced-field-netcdf branch February 8, 2023 02:14
set!(model, c=1)

Δt = 1/64 # Nice floating-point number
simulation = Simulation(model, Δt=Δt, stop_time=50Δt)
Copy link
Member

@glwagner glwagner Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do fewer steps? Our tests currently stretch the limits of our resources so we need to be as parsimonious as possible when adding new tests. Note also that compilation cost is the main thing. If this test can be combined with another test, that'd be ideal. For example, many different NetCDF tests could use the same simulation --- there's no need to run independent simulations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also suggest using stop_iteration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @glwagner I ended up merging before you commented. I copied the test template for others in the same file so there are more tests that we apply this change to. Would you like me to open another PR for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm pointing out that we don't want to copy/paste test code now without care, since a lot of our test code is poorly written / wasteful and our CI is straining under the pressure... :-/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any PRs that reduce test cost will be greatly appreciated! I can't tell if all the changes will make a big difference, you are better placed to analyze that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After running these a few time for this PR, I'd say that this one won't make much of a difference. But I have identified several tests that instantiate its own model each that could be merged together. That, I think, will have a more significant impact. I'll open a PR about it once some of my other PRs are merged!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when writing Fields reduced over 3 dimensions to NetCDF
4 participants