Skip to content

Merging from staging to master.#759

Merged
junliume merged 27 commits into
amd-masterfrom
amd-develop
Jun 19, 2023
Merged

Merging from staging to master.#759
junliume merged 27 commits into
amd-masterfrom
amd-develop

Conversation

@illsilin
Copy link
Copy Markdown
Collaborator

  • Enable gemm_dl and other kernels on Navi3x. (Enable gemm_dl and other kernels on Navi3x. #714)

  • enable dl kernels on navi3

  • do not build xdl tests and examples on Navi

  • run tests before building everything on jenkins

  • disable gemm_bilinear on gfx1030

  • add gpu targets to installer on Navi

  • put tests in the same order as before

  • reduce the number of navi targets in CI

  • build CI installed for gfx940 as well

  • only build for MI300 during QA runs

  • Pool3d fwd (Pool3d fwd #697)

  • Expand the base class of pool2d, prepare to share base class with pool3d

  • Add pool3d device op

  • Add pool3d f16 example

  • Refactor the base class. implement generic pooling in the future

  • clang format

  • get original index in max pooling

  • Add outputindex to base class

  • Fix dimension

  • Add pooling instance

  • Use indexType instead

  • Remove useless header

  • Extract IndexDataType to template

  • Extract pooling reference code

  • clang format

  • clang format

  • Fix typo

  • Add tensor stride

  • Add missing header

  • Add index stride and output stride

  • Refine naming

  • Add type to base class

  • Rename file

  • Use proper size

  • Fix typo

  • Refine naming

  • Modify the argument into vector.

  • Add max pool profiler

  • Refine naming

  • Support f32 pool

  • Fix typo

  • Add avg pool2d fwd in profiler

  • clang format

  • Rename AccDatatype to ComputeDatatype

  • Fix init

  • test pool

  • Extract variable

  • Add client example

  • Check the pooling dim

  • clang format

  • Connect argv and arg_parser

  • Add found check

  • Remove useless header

  • Refine naming

  • Adjust the order of device_pool_fwd

  • Clean-up the headers (Clean-up the headers #713)

  • fix headers for gpu instances

  • remove unused headers


  • Add instances for fp16/int8 Gemm kernels (Navi21) (Add instances for fp16/int8 Gemm kernels (Navi21) #717)

  • Add instances for fp16/int8 Gemm kernels (Navi21)

  • Extend instances with smaller tiles

  • Fix SrcVectorTensor for km_kn_mn int8

  • Multiple fixes to GroupedGemm+SplitK (Multiple fixes to GroupedGemm+SplitK #707)

  • Add license header.

  • Reduce number of logged output. Add constant initialization.

  • Add functional tests for grouped_gemm with different kbatch value.

  • Add debug log informations + remove unused code.

  • Don't pass kbatch to CalculateKPadded.

  • Turn on logging in grouped gemm and gemm splitk profiler

  • Debug: limit number of test cases to run;

  • Log more information and initialize with constant value.

  • Turn on DEBUG_LOG

  • Add more debug log informations.

  • Limit the number of instances to compile.

  • Use GridwiseGemmPipeline

  • Use KBatch to calculate K0

  • Multiple DebugLog messages.

  • Unit tests for multiple KBatch values.

  • Refactoring

  • Disable logging

  • extract out of if statement KBatch update.

  • Uncomment instances.

  • Disable DebugLog.

  • Use Kbatch when calculate KPadded.

  • Fix CGridDesc padding.

  • Use available helper functions.

  • Uncomment code commented for debuggin.

  • Remove unnecessary debug log messages.

  • Uncomment previously commented code for debug purposes.

  • Add KBatch info to profiler output summary log.

  • Add gtests for gemm splitk using ckProfiler API.

  • Add more test-cases for different data layout.

  • Add more test cases for gemm splitk

  • Remove old test.

  • Unit tests for MKNK ggemm interface.

  • Fix and add more unit-tests.

  • Constepxr everything!

  • Increase error threshold for fp16 and splitk.

Since we're using fp16 atomic add for splitk there's a known precision loss.


  • Simplify kernel argument of device operator DeviceGemm_Xdl_CShuffle<> (Simplify kernel argument of device operator DeviceGemm_Xdl_CShuffle<> #696)

  • Remove M/N/KPad local variables

  • Use M/N/KPad to name padded lengths

  • Replace duplicated local variable by parameters

  • Rename variables M/N/KRaw to M/N/K

  • Move AK0/BK0 compute logic into GridwiseGemm

  • Use macro to shorten code

  • Move CalculateGridSize() logic into GridwiseGemm

  • Add comment to credit the implementation source

  • Reuse the existing implementation

  • Remove no-longer used data members

  • Remove elementwise-op objects from interfaces

  • Reserve kernel arg as whole object in interfaces

  • Remove redundant data member

  • Make 3rd type parameter optional

  • Remove unnesscary type parameters

  • Remove no-longer used descriptor-creation methods

  • Move kernel arg type definition into GridwiseGemm

  • Add macro to switch between code sections

  • Move argument field computing logic into device op side

  • Make utility method 'static'

  • Declare special methods

  • Unify MakeArgument() usage

  • Adapt the new GridwiseGemm interface

  • Push-down class 'GridwiseGemm::Argument' fields

  • Remove no-longer used methods

  • Add unused parameters

  • Force copying parameters in 'Embed' ctor

  • Remove no-longer used descriptors

  • Fallback change on BaseArgument

  • Remove macro 'INTEGER_DIVIDE_CEIL'

  • Make variable naming more consistent

  • Make sure methods are only invoked on right place

  • Remove tailing underscore in public attribute name

  • Remove necessary methods

  • Hide computing logic of derived attributes

  • Make new 'Embed' ctor only available for device code

  • Make sure 'Embed' type args are not references

  • Move check for karg.K into CheckValidity()

  • Remove more integer division logic form device code

  • Undo changes on Embed

  • Separate 'Problem' concept out from 'Argument'

  • Share same name for kernel interfaces

  • Reject unsupported argument


  • fix wmma gemm int8; add grouped conv int8 example ([Navi3x bug fix] fix wmma gemm int8; add grouped conv int8 example #716)

  • Add class type support for __builtin_amdgcn_readfirstlane() (Add class type support for __builtin_amdgcn_readfirstlane() #711)

  • Add overloaded version of __builtin_amdgcn_readfirstlane()

  • Remove 'static' specifiers

  • Remove more 'static' specifier

  • Replace unsigne char by std::byte

  • Add 'const' specifier to never changing variable

  • Add 'inline' specifier to funcion definition

  • Fix wrong boundar calculation logic

  • Rename type trait

  • Remove std:: qualifier from standard types

  • Replace 'size_t' by 'unsigned'

  • Use type alias to hint usage

  • Replace static_for<> by ordinary 'for' loop

  • Rename readfirstlane() to amd_wave_read_first_lane()

  • Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp

  • Reorder statements

  • update copyright headers (update copyright headers #726)

  • Simplify kernel argument of device operator Device(Batched)GemmXdl<> (Simplify kernel argument of device operator Device(Batched)GemmXdl<> #723)

  • Remove M/N/KPad local variables

  • Use M/N/KPad to name padded lengths

  • Replace duplicated local variable by parameters

  • Rename variables M/N/KRaw to M/N/K

  • Move AK0/BK0 compute logic into GridwiseGemm

  • Use macro to shorten code

  • Move CalculateGridSize() logic into GridwiseGemm

  • Add comment to credit the implementation source

  • Reuse the existing implementation

  • Remove no-longer used data members

  • Remove elementwise-op objects from interfaces

  • Reserve kernel arg as whole object in interfaces

  • Remove redundant data member

  • Make 3rd type parameter optional

  • Remove unnesscary type parameters

  • Remove no-longer used descriptor-creation methods

  • Move kernel arg type definition into GridwiseGemm

  • Add macro to switch between code sections

  • Move argument field computing logic into device op side

  • Make utility method 'static'

  • Declare special methods

  • Unify MakeArgument() usage

  • Adapt the new GridwiseGemm interface

  • Push-down class 'GridwiseGemm::Argument' fields

  • Remove no-longer used methods

  • Add unused parameters

  • Force copying parameters in 'Embed' ctor

  • Remove no-longer used descriptors

  • Fallback change on BaseArgument

  • Remove macro 'INTEGER_DIVIDE_CEIL'

  • Make variable naming more consistent

  • Make sure methods are only invoked on right place

  • Remove tailing underscore in public attribute name

  • Remove necessary methods

  • Hide computing logic of derived attributes

  • Make new 'Embed' ctor only available for device code

  • Make sure 'Embed' type args are not references

  • Move check for karg.K into CheckValidity()

  • Remove more integer division logic form device code

  • Undo changes on Embed

  • Separate 'Problem' concept out from 'Argument'

  • Add overloaded version of __builtin_amdgcn_readfirstlane()

  • Remove 'static' specifiers

  • Remove more 'static' specifier

  • Replace unsigne char by std::byte

  • Add 'const' specifier to never changing variable

  • Add 'inline' specifier to funcion definition

  • Share same name for kernel interfaces

  • Fix wrong boundar calculation logic

  • Leave the third template arg for compatibility

  • Remove unnecessary parameters

  • Fix wrong error message (for type name)

  • Create descriptor on device side

  • Fix wrong debug message

  • Remove no-longer used data members

  • Rename type trait

  • Remove std:: qualifier from standard types

  • Replace 'size_t' by 'unsigned'

  • Use type alias to hint usage

  • Replace static_for<> by ordinary 'for' loop

  • Reject unsupported argument

  • Rename readfirstlane() to amd_wave_read_first_lane()

  • Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp

  • Update function calls

  • Reorder statements

  • Re-format files




  • Enable gfx941 and gfx942 architectures. (Enable gfx941 and gfx942 architectures. #752)

  • enable gfx941/942 targets

  • fix clang format

  • fix the cmake logic for multiple targets

  • fix cmake syntax for looping over targets

  • add gfx941/942 support for gemm_xdl instances


illsilin and others added 23 commits May 23, 2023 11:23
* enable dl kernels on navi3

* do not build xdl tests and examples on Navi

* run tests before building everything on jenkins

* disable gemm_bilinear on gfx1030

* add gpu targets to installer on Navi

* put tests in the same order as before

* reduce the number of navi targets in CI

* build CI installed for gfx940 as well

* only build for MI300 during QA runs
* Expand the base class of pool2d, prepare to share base class with pool3d

* Add pool3d device op

* Add pool3d f16 example

* Refactor the base class. implement generic pooling in the future

* clang format

* get original index in max pooling

* Add outputindex to base class

* Fix dimension

* Add pooling instance

* Use indexType instead

* Remove useless header

* Extract IndexDataType to template

* Extract pooling reference code

* clang format

* clang format

* Fix typo

* Add tensor stride

* Add missing header

* Add index stride and output stride

* Refine naming

* Add type to base class

* Rename file

* Use proper size

* Fix typo

* Refine naming

* Modify the argument into vector.

* Add max pool profiler

* Refine naming

* Support f32 pool

* Fix typo

* Add avg pool2d fwd in profiler

* clang format

* Rename AccDatatype to ComputeDatatype

* Fix init

* test pool

* Extract variable

* Add client example

* Check the pooling dim

* clang format

* Connect argv and arg_parser

* Add found check

* Remove useless header

* Refine naming

* Adjust the order of device_pool_fwd
* fix headers for gpu instances

* remove unused headers

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add instances for fp16/int8 Gemm kernels (Navi21)

* Extend instances with smaller tiles

* Fix SrcVectorTensor for km_kn_mn int8
* Add license header.

* Reduce number of logged output. Add constant initialization.

* Add functional tests for grouped_gemm with different kbatch value.

* Add debug log informations + remove unused code.

* Don't pass kbatch to CalculateKPadded.

* Turn on logging in grouped gemm and gemm splitk profiler

* Debug: limit number of test cases to run;

* Log more information and initialize with constant value.

* Turn on DEBUG_LOG

* Add more debug log informations.

* Limit the number of instances to compile.

* Use GridwiseGemmPipeline

* Use KBatch to calculate K0

* Multiple DebugLog messages.

* Unit tests for multiple KBatch values.

* Refactoring

* Disable logging
* extract out of if statement KBatch update.

* Uncomment instances.

* Disable DebugLog.

* Use Kbatch when calculate KPadded.

* Fix CGridDesc padding.

* Use available helper functions.

* Uncomment code commented for debuggin.

* Remove unnecessary debug log messages.

* Uncomment previously commented code for debug purposes.

* Add KBatch info to profiler output summary log.

* Add gtests for gemm splitk using ckProfiler API.

* Add more test-cases for different data layout.

* Add more test cases for gemm splitk

* Remove old test.

* Unit tests for MKNK ggemm interface.

* Fix and add more unit-tests.

* Constepxr everything!

* Increase error threshold for fp16 and splitk.

Since we're using fp16 atomic add for splitk there's a
known precision loss.

---------

Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
…#696)

* Remove M/N/KPad local variables

* Use M/N/KPad to name padded lengths

* Replace duplicated local variable by parameters

* Rename variables M/N/KRaw to M/N/K

* Move AK0/BK0 compute logic into GridwiseGemm

* Use macro to shorten code

* Move CalculateGridSize() logic into GridwiseGemm

* Add comment to credit the implementation source

* Reuse the existing implementation

* Remove no-longer used data members

* Remove elementwise-op objects from interfaces

* Reserve kernel arg as whole object in interfaces

* Remove redundant data member

* Make 3rd type parameter optional

* Remove unnesscary type parameters

* Remove no-longer used descriptor-creation methods

* Move kernel arg type definition into GridwiseGemm

* Add macro to switch between code sections

* Move argument field computing logic into device op side

* Make utility method 'static'

* Declare special methods

* Unify MakeArgument() usage

* Adapt the new GridwiseGemm interface

* Push-down class 'GridwiseGemm::Argument' fields

* Remove no-longer used methods

* Add unused parameters

* Force copying parameters in 'Embed' ctor

* Remove no-longer used descriptors

* Fallback change on BaseArgument

* Remove macro 'INTEGER_DIVIDE_CEIL'

* Make variable naming more consistent

* Make sure methods are only invoked on right place

* Remove tailing underscore in public attribute name

* Remove necessary methods

* Hide computing logic of derived attributes

* Make new 'Embed' ctor only available for device code

* Make sure 'Embed' type args are not references

* Move check for karg.K into CheckValidity()

* Remove more integer division logic form device code

* Undo changes on Embed

* Separate 'Problem' concept out from 'Argument'

* Share same name for kernel interfaces

* Reject unsupported argument

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add overloaded version of __builtin_amdgcn_readfirstlane()

* Remove 'static' specifiers

* Remove more 'static' specifier

* Replace unsigne char by std::byte

* Add 'const' specifier to never changing variable

* Add 'inline' specifier to funcion definition

* Fix wrong boundar calculation logic

* Rename type trait

* Remove std:: qualifier from standard types

* Replace 'size_t' by 'unsigned'

* Use type alias to hint usage

* Replace static_for<> by ordinary 'for' loop

* Rename readfirstlane() to amd_wave_read_first_lane()

* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp

* Reorder statements
…723)

* Remove M/N/KPad local variables

* Use M/N/KPad to name padded lengths

* Replace duplicated local variable by parameters

* Rename variables M/N/KRaw to M/N/K

* Move AK0/BK0 compute logic into GridwiseGemm

* Use macro to shorten code

* Move CalculateGridSize() logic into GridwiseGemm

* Add comment to credit the implementation source

* Reuse the existing implementation

* Remove no-longer used data members

* Remove elementwise-op objects from interfaces

* Reserve kernel arg as whole object in interfaces

* Remove redundant data member

* Make 3rd type parameter optional

* Remove unnesscary type parameters

* Remove no-longer used descriptor-creation methods

* Move kernel arg type definition into GridwiseGemm

* Add macro to switch between code sections

* Move argument field computing logic into device op side

* Make utility method 'static'

* Declare special methods

* Unify MakeArgument() usage

* Adapt the new GridwiseGemm interface

* Push-down class 'GridwiseGemm::Argument' fields

* Remove no-longer used methods

* Add unused parameters

* Force copying parameters in 'Embed' ctor

* Remove no-longer used descriptors

* Fallback change on BaseArgument

* Remove macro 'INTEGER_DIVIDE_CEIL'

* Make variable naming more consistent

* Make sure methods are only invoked on right place

* Remove tailing underscore in public attribute name

* Remove necessary methods

* Hide computing logic of derived attributes

* Make new 'Embed' ctor only available for device code

* Make sure 'Embed' type args are not references

* Move check for karg.K into CheckValidity()

* Remove more integer division logic form device code

* Undo changes on Embed

* Separate 'Problem' concept out from 'Argument'

* Add overloaded version of __builtin_amdgcn_readfirstlane()

* Remove 'static' specifiers

* Remove more 'static' specifier

* Replace unsigne char by std::byte

* Add 'const' specifier to never changing variable

* Add 'inline' specifier to funcion definition

* Share same name for kernel interfaces

* Fix wrong boundar calculation logic

* Leave the third template arg for compatibility

* Remove unnecessary parameters

* Fix wrong error message (for type name)

* Create descriptor on device side

* Fix wrong debug message

* Remove no-longer used data members

* Rename type trait

* Remove std:: qualifier from standard types

* Replace 'size_t' by 'unsigned'

* Use type alias to hint usage

* Replace static_for<> by ordinary 'for' loop

* Reject unsupported argument

* Rename readfirstlane() to amd_wave_read_first_lane()

* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp

* Update function calls

* Reorder statements

* Re-format files

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
* update dockerfile to build rocm5.6 rc3

* fix couple of docker issues
* add check input parameter

* add instance for vector load = 1

* move gerneral instance to first pos

* fix read bias code

* regular code for bias load

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
…ane() (#738)

* Fix wrong pointer type

* Rename type trait get_unsigned_int<> to get_carrier<>

* Add 3-bytes carrier type

* Add missing __device__ specifier

* Rename template non-type parameter

* Leave the rest byte uninitialized

* Avoid invoking (host) STL algorithms

* Remove unnecessary 'inline' specifier

* Extract common logic out as helper method

* Hide dummy member function

* Add missing __device__ specifier
* Add DeviceBatchedGemmMultipleD_Dl

* Fix batched_gemm tests

* Fix comments

* test_batched_gemm_multi_d fixes

* Fix args for isSupported batchedGemmMultipleDDl

* Disable tests for gfx90a
…GemmMultipleD (#741)

* Add generic instance gemm_add_add_fastgelu

* Add a client example for generic gemm_add_add_fastgelu

* Update CMakeLists

* Format

* Format

* Add generic instance gemm_add_fastgelu

* Format

* Add a gemm_add_fastgelu client example

* Format

* Add generic instance gemm_fastgelu

* Format

* Fix argument order

* Add gemm_fastgelu client example

* Add exceptions if argument is not supported
* fix CI builds with latest staging compiler

* remove mount flags from dockerfile
* Add getAvailableComputeUnitCount() interface

* Use available number of compute units to set kernel grid size
* Changed wei layout

* changed layout for examples

* fixed client example

---------

Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
* enable gfx941/942 targets

* fix clang format

* fix the cmake logic for multiple targets

* fix cmake syntax for looping over targets

* add gfx941/942 support for gemm_xdl instances
@illsilin illsilin requested review from asroy and zjing14 June 16, 2023 16:01
illsilin and others added 2 commits June 16, 2023 12:13
* Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting

* Move the generic kernel instance to be the first of the instance list for elementwise op of normalization

* Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax

* Add testing of GetGenericInstance() in client_example of Softmax

* Revert "Add testing of GetGenericInstance() in client_example of Softmax"

This reverts commit f629cd9.

* Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax"

This reverts commit a9f0d00.

* Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm

* Move generic kernel instance to separate tuple for elementwise op of normalization

* Remove un-used files for softmax instance

* Store generic kernel instance to separate tuple for softmax

* Add IsSupported checking for generic instance to client example of softmax

* Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization

* clang-format fix

* Remove int8 from softmax instances

---------

Co-authored-by: zjing14 <zhangjing14@gmail.com>
rocking5566 and others added 2 commits June 19, 2023 09:44
* Add maxpool f32 kernel and example

* Revise copyright

* Add device pool bwd device op

* Support f16 and bf16

* Add compute datatype for reference code.
Prevent error in bf16

* Fix type error

* Remove layout

* Fix bf16 error

* Add f16 and bf16 example

* Add more operations

* Implement IsSupportedArgument

* Add changelog

* Add comment

* Add comment

* Remove useless header

* Move initialize of workspace to the run

* Move set din zero to the device operator

* Save din_length_raw

* Remove useless header

* Calculate gridsize according to the number of CU

* Calculate gridSize according to the number of CU.
Remove useless header

* Add put example

* Remove useless header

* Fix CI fail
@junliume junliume merged commit fa9da1a into amd-master Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.