Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pooling] Fix FP16 precision issues (averaging mode) plus more #2118

Merged
merged 17 commits into from
May 3, 2023

Conversation

atamazov
Copy link
Contributor

@atamazov atamazov commented Apr 28, 2023

The main goal if resolving issue #2109. This is done by switching to mixed precision for FP16 average pooling. Other changes include adjustments of verification tolerance for FP32 inference and for FP16 training (to avoid false failures) and observing available amount of scratch memory in order to avoid building of kernels that would fail anyway. Plus lots of refactoring. Details:

  • [pooling kernels]
    • [Forward] 🔴 Fix: Engage mixed-precision in FP16 average pooling
    • [Backward] 🟡 Fix: Build only necessary kernels. This allows POOLING_OP_AVE to handle larger configs.
  • [pooling solver] 🔴 Fix: Make PoolingForward2d not applicable when scratch limit is exceeded.
  • [tests][pooling] 🟡 Added regression tests for issue [pooling] FP16 precision issue (averaging mode) #2109
  • [driver][pooling]
    • [Backward][FP16] 🟡 Fix: Raised verification tolerance to avoid false failures (1e-6 -> 5e-3)
    • [Forward][FP32] 🟡 Fix: Raised verification tolerance to avoid false failures (1e-6 -> 1e-5).
    • [NFC][Debugging]
      • Improved verification and printing of its results.
      • Added different initialization modes on input for Forward FP16 pooling.
      • Added emulation of validation errors for Forward pooling.
  • Other by-products that may be useful in all solvers/kernels
    • 🟢 Added MaxWaveScratchSize to TargetProperties
    • [NFC][float_types.h] This can be used in any kernel:
      • Support for for building both mixed-precision and "pure" kernels from the single source.
      • Support for uniform initialization of BF16 variables.
      • Both features implemented in 643a8d6

ℹ️ Note for reviewers

All significant commits (which change the behavior of the library OR may be useful) are listed above. I recommend looking at them first. Other commits are just refactorings.


[Attribution] @junliume @johnny-keker

…s nominal for all ops except FP16 averaging. Some refactoring of macros.
…mixed-precision and "pure" kernels from the single source. Support uniform initialization of BF16 variables.
… collecting validation stats. Improve message printed when validation fails. Support emulation of validation errors.
…initialization modes on input for FP16 (for debugging). Improve message about verification result (add max error and num_flops).
…sary kernels. This allows POOLING_OP_AVE to handle larger configs.
… and printing of its results. Raised verification tolerance for FP16. Quality fixes. Refactoring.
…or: Compute some kernel parameters in one place (to reuse in IsApplicable)
… not applicable when scratch limit is exceeded.
@junliume
Copy link
Collaborator

junliume commented May 1, 2023

@atamazov OCL tidy has failed at:

[2023-04-30T05:46:34.808Z] /var/jenkins/workspace/MLLibs_MIOpen_PR-2118/src/include/miopen/pooling/solvers.hpp:49:10: warning: function 'miopen::solver::pooling::PoolingForward2d::IsApplicable' has a definition with different parameter names [readability-inconsistent-declaration-parameter-name]

[2023-04-30T05:46:34.808Z]     bool IsApplicable(const ExecutionContext& context,

[2023-04-30T05:46:34.808Z]          ^

[2023-04-30T05:46:34.808Z] /var/jenkins/workspace/MLLibs_MIOpen_PR-2118/src/solver/pooling/forward2d.cpp:134:24: note: the definition seen here

[2023-04-30T05:46:34.808Z] bool PoolingForward2d::IsApplicable(const ExecutionContext& ec,

[2023-04-30T05:46:34.808Z]                        ^

[2023-04-30T05:46:34.808Z] /var/jenkins/workspace/MLLibs_MIOpen_PR-2118/src/include/miopen/pooling/solvers.hpp:49:10: note: differing parameters are named here: ('context'), in definition: ('ec')

[2023-04-30T05:46:34.808Z]     bool IsApplicable(const ExecutionContext& context,

[2023-04-30T05:46:34.808Z]          ^                                    ~~~~~~~

[2023-04-30T05:46:34.808Z]                                               ec

OCL_tidy_log.log

@atamazov
Copy link
Contributor Author

atamazov commented May 1, 2023

@junliume This should be fixed in dd522f0

// Compute amount of private memory required for holding the arrays defined
// in the "mloPoolingG" kernel:
//
// #define MLO_BOT_DATA_SZ0 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atamazov : there is another tidy issue

[2023-05-01T16:28:47.204Z] /home/jenkins/workspace/MLLibs_MIOpen_PR-2118/src/solver/pooling/forward2d.cpp:95:1: error: multi-line comment [-Werror=comment]

[2023-05-01T16:28:47.204Z]    95 | // #define MLO_BOT_DATA_SZ0 \

[2023-05-01T16:28:47.204Z]       | ^

[2023-05-01T16:28:47.204Z] /home/jenkins/workspace/MLLibs_MIOpen_PR-2118/src/solver/pooling/forward2d.cpp:97:1: error: multi-line comment [-Werror=comment]

[2023-05-01T16:28:47.204Z]    97 | // #define MLO_BOT_DATA_SZ1 \

[2023-05-01T16:28:47.204Z]       | ^

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And another issue:

cc1plus: error: unrecognized command line option ‘-Wno-option-ignored’ [-Werror]

Copy link
Contributor Author

@atamazov atamazov May 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junliume This unrecognized command line option ‘-Wno-option-ignored’ [-Werror] is from Jenkinsfile or from ./cmake/EnableCompilerWarnings.cmake, this PR has nothing to do with it.

Multi-line comment will be fixed soon.

@atamazov
Copy link
Contributor Author

atamazov commented May 1, 2023

@junliume

This unrecognized command line option ‘-Wno-option-ignored’ [-Werror] is from Jenkinsfile or from ./cmake/EnableCompilerWarnings.cmake, this PR has nothing to do with it.

Merged from fresh develop. Hope this helps.

Multi-line comment will be fixed soon.

46cc255

Ready for the new CI testing round, thanks.

atamazov added a commit to atamazov/MIOpen that referenced this pull request May 24, 2023
…ctness for FP16 (issue ROCm#2160).

[tests] Added regression test for issue ROCm#2160.
[tests] Added regression test for issue ROCm#2110 (1).
[tests] Fixed issues in test_miopendriver_regression_float and test_miopendriver_regression_half introduced in ROCm#2118.
[tests] Introduced WORKAROUND_ISSUE_2110_2.
bghimireamd pushed a commit that referenced this pull request Jun 1, 2023
* pooling-fp16-use-fp32-acc(01) [NFC][pooling kernels] Make ACCUM types available. Some refactoring of macros.

* pooling-fp16-use-fp32-acc(02) [NFC][pooling kernels] Make ACCUM macros nominal for all ops except FP16 averaging. Some refactoring of macros.

* pooling-fp16-use-fp32-acc(03) [NFC][kernels] Allow for building both mixed-precision and "pure" kernels from the single source. Support uniform initialization of BF16 variables.

* pooling-fp16-use-fp32-acc(04) [pooling kernels] Engage mixed-precision in FP16 average pooling

* pooling-fp16-use-fp32-acc(05) [driver][pooling][inference] Support or collecting validation stats. Improve message printed when validation fails. Support emulation of validation errors.

* pooling-fp16-use-fp32-acc(06) [driver][pooling][inference] Different initialization modes on input for FP16 (for debugging). Improve message about verification result (add max error and num_flops).

* pooling-fp16-use-fp32-acc(07) [NFC][pooling kernels] Build only necessary kernels. This allows POOLING_OP_AVE to handle larger configs.

* pooling-fp16-use-fp32-acc(08) [driver][pooling] Improved verification and printing of its results. Raised verification tolerance for FP16. Quality fixes. Refactoring.

* pooling-fp16-use-fp32-acc(09) [driver][pooling][inference] Raised verification tolerance for FP32.

* pooling-fp16-use-fp32-acc(10) [NFC][pooling] PoolingForward2d: Refactor: Compute some kernel parameters in one place (to reuse in IsApplicable)

* pooling-fp16-use-fp32-acc(11) [NFC][pooling] PoolingForward2d: Improve previous commit

* pooling-fp16-use-fp32-acc(12) Added MaxWaveScratchSize to TargetProperties

* pooling-fp16-use-fp32-acc(13) [pooling] PoolingForward2d: Make solver not applicable when scratch limit is exceeded.

* pooling-fp16-use-fp32-acc(14) [tests][pooling] Added regression tests for issue #2109
junliume pushed a commit that referenced this pull request Jun 15, 2023
* pooling-naive(01) [NFC] Added MaxLocalMemorySize to target properties

* pooling-naive(02) [NFC][pooling][backward] Refactor to compute some kernel params only once when LDS size check is implemented.

* pooling-naive(03) [pooling][backward2d] Implement LDS (local memory) size check

* pooling-naive(04) [pooling][forward_2d] Small fixes: alignment of private arrays and logging.

* pooling-naive(05) [NFC] Reformat comments.

* pooling-naive(06) [NFC][driver] Added const to argument of GetTensor... functions.

* pooling-naive(07) [pooling][forward] Add naive solver with temporary host-side-only implementation.

* pooling-naive(08) [pooling] FIX: Remove implementation specific details from NetworkConfig (a key of InvokerCache). These details lead correctness issues with other implementations due to usage of wrong invokers.

* pooling-naive(09) [pooling][forward] Naive solver: move computations from kernel to solver.

* pooling-naive(10) [pooling][forward] Naive solver: reorg host-only code to better match device execution model.

* pooling-naive(11) [tests][pooling] Enable index verification in Full tests (--all).

* pooling-naive(12) [pooling][forward] Naive solver: Mark it dynamic.

* pooling-naive(13) [pooling][forward] Naive host-only solver: Tidy fixes. Developed and applied some strategy to prevent overflows. Unified names of kernel args. Partially implemented host code for running the kernel on GPU.

* pooling-naive(14) [pooling][forward] Naive solver: Added OCL kernel.

* pooling-naive(15) [pooling][forward] Naive solver: Convert pooling op from argument to compile-time parameter.

* pooling-naive(16) [pooling][forward] Naive solver: Fixed kernel build errors and warnings. Finished code that enqueues GPU kernel. Many limitations due to grid size constraints!

* pooling-naive(17) [pooling][forward] Naive solver: Switch back to trivial grid, {N,C,D} instead of {N,C,H}. Explain some grid-related design choices. Some refactoring.

* pooling-naive(18) [pooling][forward] Naive solver: Prevent UB when grid exceeds OCL limits

* pooling-naive(19) [pooling][forward] FIX: Remove useless info (dx, dy) from NetworkConfig for Forward

* pooling-naive(20) [pooling][forward] Compute grid and workgroup. CPU emulation: Emulate grid.

* pooling-naive(21) [pooling][forward] Kernel: Engage optimized grid and workgroup sizes.

* pooling-naive(22) [pooling][forward] Reorder arguments and unify their names.

* pooling-naive(23) [pooling][forward] Naive solver: Remove emulation on CPU.

* pooling-naive(24) [pooling][forward] Take the new solver into use.

* pooling-naive(25) [pooling][forward[3D] PoolingForwardNd: Fixed correctness for FP16 (issue #2160).
[tests] Added regression test for issue #2160.
[tests] Added regression test for issue #2110 (1).
[tests] Fixed issues in test_miopendriver_regression_float and test_miopendriver_regression_half introduced in #2118.
[tests] Introduced WORKAROUND_ISSUE_2110_2.

* pooling-naive(26) [pooling][forward_3d] PoolingForwardNd: implement size checks. This enables PoolingForwardNaive for 3D when limits of PoolingForwardNd are hit.

* pooling-naive(27) Fix tidy errors

* pooling-naive(28) Added 2D/3D regression tests for issue #2110 (1).

* pooling-naive(29) Partially revert: pooling-naive(18) [pooling][forward] Naive solver: Prevent UB when grid exceeds OCL limits

* pooling-naive(30) Fix tidy error
gonidelis pushed a commit that referenced this pull request Jun 20, 2023
* pooling-naive(01) [NFC] Added MaxLocalMemorySize to target properties

* pooling-naive(02) [NFC][pooling][backward] Refactor to compute some kernel params only once when LDS size check is implemented.

* pooling-naive(03) [pooling][backward2d] Implement LDS (local memory) size check

* pooling-naive(04) [pooling][forward_2d] Small fixes: alignment of private arrays and logging.

* pooling-naive(05) [NFC] Reformat comments.

* pooling-naive(06) [NFC][driver] Added const to argument of GetTensor... functions.

* pooling-naive(07) [pooling][forward] Add naive solver with temporary host-side-only implementation.

* pooling-naive(08) [pooling] FIX: Remove implementation specific details from NetworkConfig (a key of InvokerCache). These details lead correctness issues with other implementations due to usage of wrong invokers.

* pooling-naive(09) [pooling][forward] Naive solver: move computations from kernel to solver.

* pooling-naive(10) [pooling][forward] Naive solver: reorg host-only code to better match device execution model.

* pooling-naive(11) [tests][pooling] Enable index verification in Full tests (--all).

* pooling-naive(12) [pooling][forward] Naive solver: Mark it dynamic.

* pooling-naive(13) [pooling][forward] Naive host-only solver: Tidy fixes. Developed and applied some strategy to prevent overflows. Unified names of kernel args. Partially implemented host code for running the kernel on GPU.

* pooling-naive(14) [pooling][forward] Naive solver: Added OCL kernel.

* pooling-naive(15) [pooling][forward] Naive solver: Convert pooling op from argument to compile-time parameter.

* pooling-naive(16) [pooling][forward] Naive solver: Fixed kernel build errors and warnings. Finished code that enqueues GPU kernel. Many limitations due to grid size constraints!

* pooling-naive(17) [pooling][forward] Naive solver: Switch back to trivial grid, {N,C,D} instead of {N,C,H}. Explain some grid-related design choices. Some refactoring.

* pooling-naive(18) [pooling][forward] Naive solver: Prevent UB when grid exceeds OCL limits

* pooling-naive(19) [pooling][forward] FIX: Remove useless info (dx, dy) from NetworkConfig for Forward

* pooling-naive(20) [pooling][forward] Compute grid and workgroup. CPU emulation: Emulate grid.

* pooling-naive(21) [pooling][forward] Kernel: Engage optimized grid and workgroup sizes.

* pooling-naive(22) [pooling][forward] Reorder arguments and unify their names.

* pooling-naive(23) [pooling][forward] Naive solver: Remove emulation on CPU.

* pooling-naive(24) [pooling][forward] Take the new solver into use.

* pooling-naive(25) [pooling][forward[3D] PoolingForwardNd: Fixed correctness for FP16 (issue #2160).
[tests] Added regression test for issue #2160.
[tests] Added regression test for issue #2110 (1).
[tests] Fixed issues in test_miopendriver_regression_float and test_miopendriver_regression_half introduced in #2118.
[tests] Introduced WORKAROUND_ISSUE_2110_2.

* pooling-naive(26) [pooling][forward_3d] PoolingForwardNd: implement size checks. This enables PoolingForwardNaive for 3D when limits of PoolingForwardNd are hit.

* pooling-naive(27) Fix tidy errors

* pooling-naive(28) Added 2D/3D regression tests for issue #2110 (1).

* pooling-naive(29) Partially revert: pooling-naive(18) [pooling][forward] Naive solver: Prevent UB when grid exceeds OCL limits

* pooling-naive(30) Fix tidy error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants