Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust floating point comparisons #495

Merged
merged 30 commits into from
Jul 2, 2024
Merged

Conversation

ChrisZYJ
Copy link
Contributor

Description

This PR completes #482 by replacing all == dflt_real and /= dflt_real with f_is_default().

It creates a new module named m_helper_basic to prevent circular dependency (the function is used within m_global_parameters. Please let me know if this is a good approach or can be changed.

Type of change

  • Something else

Scope

  • This PR comprises a set of related changes with a common goal

How Has This Been Tested?

  • All tests passed on CPUs locally
  • All tests passed on GPUs on Delta

Test Configuration:

  • What computers and compilers did you use to test this:
    • CPU: Ubuntu 22.04.4 LTS; GNU v11.4.0
    • GPU: Delta

Checklist

  • I have added comments for the new code
  • I added Doxygen docstrings to the new code
  • I have made corresponding changes to the documentation (docs/) (not applicable)
  • I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
  • I have added example cases in examples/ that demonstrate my new feature performing as expected.
    They run to completion and demonstrate "interesting physics" (not applicable)
  • I ran ./mfc.sh format before committing my code
  • New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled (not for AMD hardware)
  • This PR does not introduce any repeated code (it follows the DRY principle)
  • I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in src/simulation)

To make sure the code is performing as expected on GPU devices, I have:

  • Checked that the code compiles using NVHPC compilers
  • Checked that the code compiles using CRAY compilers
  • Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
  • Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
  • Enclosed the new feature via nvtx ranges so that they can be identified in profiles (not applicable)
  • Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
    nsys.txt
  • Ran an Omniperf profile using ./mfc.sh run XXXX --gpu -t simulation --omniperf, and have attached the output file and plain text results to this PR.
  • Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature
    performance

@sbryngelson
Copy link
Member

This is really nice @ChrisZYJ

@sbryngelson
Copy link
Member

sbryngelson commented Jun 29, 2024

This PR fails benchmarking because the benchmark case (viscous sgb) actually NaNs. This issue appears the same as the one documented in #396. I suspect this is showing up now due to a change in the compilers or some other environment variable on Phoenix. It exits with a NaN when attempting to write the first output file (which I suspect is when the code checks for NaNs). This benchmark case might just be broken.

@ChrisZYJ you can just run this case on Delta outside of CI and watch it fail. We should be able to fix it.

I do find it odd that when looking at the logs, only the PR NaNed. It's a strange issue.

@ChrisZYJ
Copy link
Contributor Author

@sbryngelson Really sorry for the mistake in m_monopole which escaped all the tests. Hopefully this fixes the problem

@sbryngelson
Copy link
Member

Thanks @ChrisZYJ! No apology is needed; this just means we need better tests (we always do, but having codecov definitely helps). Can you add a few tests that would check monopole "better"?

Also looks like tests are still failing... maybe my original hypothesis was correct as well?

@ChrisZYJ
Copy link
Contributor Author

ChrisZYJ commented Jun 29, 2024

Yeah you're definitely right. I just tried that benchmark on Delta and saw the same problem as #396. I've double checked this PR again and everything should be right, so I'm really not sure what's going on. On a side note, numerous tests fail when I run ./mfc.sh test -g 4 -- '-cdelta' with GPU, starting from 301B9153 (all tests are passed without the -g 4 option. I've also tested on v4.9.1 and same thing happens. This may or may not be related.

I'll definitely add more tests for monopole, probably in another PR. I'm actually working on more monopole options, so maybe they can come together.

@sbryngelson
Copy link
Member

sbryngelson commented Jun 29, 2024

Ok, I would like to hold off merging PRs until benchmarking is running. This issue you witness is not Delta-specific (I don't think). If you check the issue, I tried different compiler versions and computers and GPUs. I think it's some environment thing. But we should be able to track it down. For example, do the tests actually NaN or not? If not, why are they failing?

I'm ok with holding off on more monopole tests until another PR.

@ChrisZYJ
Copy link
Contributor Author

Thanks for the clarifications! If there are any specific ways I can assist in resolving the problem, please let me know. I'll also add the monopole tests after benchmarking is working

@sbryngelson
Copy link
Member

Thanks for the clarifications! If there are any specific ways I can assist in resolving the problem, please let me know. I'll also add the monopole tests after benchmarking is working

This issue has been open for a couple of months so feel free to take a crack at it.

@sbryngelson
Copy link
Member

@ChrisZYJ A PR after yours is passing the GPU benchmark (#496). I think you have a bug in your code that is causing problems on some GPU cases on Frontier and Phoenix (V100 machine).

Still, issue #396 needs to be resolved, though it appears separate.

@ChrisZYJ
Copy link
Contributor Author

ChrisZYJ commented Jul 1, 2024

@sbryngelson I fixed the benchmark problem in #497. I'm merging it into this PR and this PR should then pass the GPU benchmark.

@sbryngelson
Copy link
Member

@ChrisZYJ nice - I commented on the other PR for a minor change. Once they pass tests and that case is fixed I'll merge them both.

@sbryngelson sbryngelson merged commit 9f6b81f into MFlowCode:master Jul 2, 2024
18 of 19 checks passed
AiredaleDev pushed a commit to AiredaleDev/MFC that referenced this pull request Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants