Merging from staging to master.#759
Merged
Merged
Conversation
* enable dl kernels on navi3 * do not build xdl tests and examples on Navi * run tests before building everything on jenkins * disable gemm_bilinear on gfx1030 * add gpu targets to installer on Navi * put tests in the same order as before * reduce the number of navi targets in CI * build CI installed for gfx940 as well * only build for MI300 during QA runs
* Expand the base class of pool2d, prepare to share base class with pool3d * Add pool3d device op * Add pool3d f16 example * Refactor the base class. implement generic pooling in the future * clang format * get original index in max pooling * Add outputindex to base class * Fix dimension * Add pooling instance * Use indexType instead * Remove useless header * Extract IndexDataType to template * Extract pooling reference code * clang format * clang format * Fix typo * Add tensor stride * Add missing header * Add index stride and output stride * Refine naming * Add type to base class * Rename file * Use proper size * Fix typo * Refine naming * Modify the argument into vector. * Add max pool profiler * Refine naming * Support f32 pool * Fix typo * Add avg pool2d fwd in profiler * clang format * Rename AccDatatype to ComputeDatatype * Fix init * test pool * Extract variable * Add client example * Check the pooling dim * clang format * Connect argv and arg_parser * Add found check * Remove useless header * Refine naming * Adjust the order of device_pool_fwd
* fix headers for gpu instances * remove unused headers --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add instances for fp16/int8 Gemm kernels (Navi21) * Extend instances with smaller tiles * Fix SrcVectorTensor for km_kn_mn int8
* Add license header. * Reduce number of logged output. Add constant initialization. * Add functional tests for grouped_gemm with different kbatch value. * Add debug log informations + remove unused code. * Don't pass kbatch to CalculateKPadded. * Turn on logging in grouped gemm and gemm splitk profiler * Debug: limit number of test cases to run; * Log more information and initialize with constant value. * Turn on DEBUG_LOG * Add more debug log informations. * Limit the number of instances to compile. * Use GridwiseGemmPipeline * Use KBatch to calculate K0 * Multiple DebugLog messages. * Unit tests for multiple KBatch values. * Refactoring * Disable logging * extract out of if statement KBatch update. * Uncomment instances. * Disable DebugLog. * Use Kbatch when calculate KPadded. * Fix CGridDesc padding. * Use available helper functions. * Uncomment code commented for debuggin. * Remove unnecessary debug log messages. * Uncomment previously commented code for debug purposes. * Add KBatch info to profiler output summary log. * Add gtests for gemm splitk using ckProfiler API. * Add more test-cases for different data layout. * Add more test cases for gemm splitk * Remove old test. * Unit tests for MKNK ggemm interface. * Fix and add more unit-tests. * Constepxr everything! * Increase error threshold for fp16 and splitk. Since we're using fp16 atomic add for splitk there's a known precision loss. --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>
…#696) * Remove M/N/KPad local variables * Use M/N/KPad to name padded lengths * Replace duplicated local variable by parameters * Rename variables M/N/KRaw to M/N/K * Move AK0/BK0 compute logic into GridwiseGemm * Use macro to shorten code * Move CalculateGridSize() logic into GridwiseGemm * Add comment to credit the implementation source * Reuse the existing implementation * Remove no-longer used data members * Remove elementwise-op objects from interfaces * Reserve kernel arg as whole object in interfaces * Remove redundant data member * Make 3rd type parameter optional * Remove unnesscary type parameters * Remove no-longer used descriptor-creation methods * Move kernel arg type definition into GridwiseGemm * Add macro to switch between code sections * Move argument field computing logic into device op side * Make utility method 'static' * Declare special methods * Unify MakeArgument() usage * Adapt the new GridwiseGemm interface * Push-down class 'GridwiseGemm::Argument' fields * Remove no-longer used methods * Add unused parameters * Force copying parameters in 'Embed' ctor * Remove no-longer used descriptors * Fallback change on BaseArgument * Remove macro 'INTEGER_DIVIDE_CEIL' * Make variable naming more consistent * Make sure methods are only invoked on right place * Remove tailing underscore in public attribute name * Remove necessary methods * Hide computing logic of derived attributes * Make new 'Embed' ctor only available for device code * Make sure 'Embed' type args are not references * Move check for karg.K into CheckValidity() * Remove more integer division logic form device code * Undo changes on Embed * Separate 'Problem' concept out from 'Argument' * Share same name for kernel interfaces * Reject unsupported argument --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add overloaded version of __builtin_amdgcn_readfirstlane() * Remove 'static' specifiers * Remove more 'static' specifier * Replace unsigne char by std::byte * Add 'const' specifier to never changing variable * Add 'inline' specifier to funcion definition * Fix wrong boundar calculation logic * Rename type trait * Remove std:: qualifier from standard types * Replace 'size_t' by 'unsigned' * Use type alias to hint usage * Replace static_for<> by ordinary 'for' loop * Rename readfirstlane() to amd_wave_read_first_lane() * Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp * Reorder statements
…723) * Remove M/N/KPad local variables * Use M/N/KPad to name padded lengths * Replace duplicated local variable by parameters * Rename variables M/N/KRaw to M/N/K * Move AK0/BK0 compute logic into GridwiseGemm * Use macro to shorten code * Move CalculateGridSize() logic into GridwiseGemm * Add comment to credit the implementation source * Reuse the existing implementation * Remove no-longer used data members * Remove elementwise-op objects from interfaces * Reserve kernel arg as whole object in interfaces * Remove redundant data member * Make 3rd type parameter optional * Remove unnesscary type parameters * Remove no-longer used descriptor-creation methods * Move kernel arg type definition into GridwiseGemm * Add macro to switch between code sections * Move argument field computing logic into device op side * Make utility method 'static' * Declare special methods * Unify MakeArgument() usage * Adapt the new GridwiseGemm interface * Push-down class 'GridwiseGemm::Argument' fields * Remove no-longer used methods * Add unused parameters * Force copying parameters in 'Embed' ctor * Remove no-longer used descriptors * Fallback change on BaseArgument * Remove macro 'INTEGER_DIVIDE_CEIL' * Make variable naming more consistent * Make sure methods are only invoked on right place * Remove tailing underscore in public attribute name * Remove necessary methods * Hide computing logic of derived attributes * Make new 'Embed' ctor only available for device code * Make sure 'Embed' type args are not references * Move check for karg.K into CheckValidity() * Remove more integer division logic form device code * Undo changes on Embed * Separate 'Problem' concept out from 'Argument' * Add overloaded version of __builtin_amdgcn_readfirstlane() * Remove 'static' specifiers * Remove more 'static' specifier * Replace unsigne char by std::byte * Add 'const' specifier to never changing variable * Add 'inline' specifier to funcion definition * Share same name for kernel interfaces * Fix wrong boundar calculation logic * Leave the third template arg for compatibility * Remove unnecessary parameters * Fix wrong error message (for type name) * Create descriptor on device side * Fix wrong debug message * Remove no-longer used data members * Rename type trait * Remove std:: qualifier from standard types * Replace 'size_t' by 'unsigned' * Use type alias to hint usage * Replace static_for<> by ordinary 'for' loop * Reject unsupported argument * Rename readfirstlane() to amd_wave_read_first_lane() * Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp * Update function calls * Reorder statements * Re-format files --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>
* update dockerfile to build rocm5.6 rc3 * fix couple of docker issues
* add check input parameter * add instance for vector load = 1 * move gerneral instance to first pos * fix read bias code * regular code for bias load --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>
…ane() (#738) * Fix wrong pointer type * Rename type trait get_unsigned_int<> to get_carrier<> * Add 3-bytes carrier type * Add missing __device__ specifier * Rename template non-type parameter * Leave the rest byte uninitialized * Avoid invoking (host) STL algorithms * Remove unnecessary 'inline' specifier * Extract common logic out as helper method * Hide dummy member function * Add missing __device__ specifier
* Add DeviceBatchedGemmMultipleD_Dl * Fix batched_gemm tests * Fix comments * test_batched_gemm_multi_d fixes * Fix args for isSupported batchedGemmMultipleDDl * Disable tests for gfx90a
…GemmMultipleD (#741) * Add generic instance gemm_add_add_fastgelu * Add a client example for generic gemm_add_add_fastgelu * Update CMakeLists * Format * Format * Add generic instance gemm_add_fastgelu * Format * Add a gemm_add_fastgelu client example * Format * Add generic instance gemm_fastgelu * Format * Fix argument order * Add gemm_fastgelu client example * Add exceptions if argument is not supported
* fix CI builds with latest staging compiler * remove mount flags from dockerfile
* Add getAvailableComputeUnitCount() interface * Use available number of compute units to set kernel grid size
* Changed wei layout * changed layout for examples * fixed client example --------- Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
* enable gfx941/942 targets * fix clang format * fix the cmake logic for multiple targets * fix cmake syntax for looping over targets * add gfx941/942 support for gemm_xdl instances
* Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting * Move the generic kernel instance to be the first of the instance list for elementwise op of normalization * Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax * Add testing of GetGenericInstance() in client_example of Softmax * Revert "Add testing of GetGenericInstance() in client_example of Softmax" This reverts commit f629cd9. * Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax" This reverts commit a9f0d00. * Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm * Move generic kernel instance to separate tuple for elementwise op of normalization * Remove un-used files for softmax instance * Store generic kernel instance to separate tuple for softmax * Add IsSupported checking for generic instance to client example of softmax * Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization * clang-format fix * Remove int8 from softmax instances --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>
zjing14
approved these changes
Jun 19, 2023
* Add maxpool f32 kernel and example * Revise copyright * Add device pool bwd device op * Support f16 and bf16 * Add compute datatype for reference code. Prevent error in bf16 * Fix type error * Remove layout * Fix bf16 error * Add f16 and bf16 example * Add more operations * Implement IsSupportedArgument * Add changelog * Add comment * Add comment * Remove useless header * Move initialize of workspace to the run * Move set din zero to the device operator * Save din_length_raw * Remove useless header * Calculate gridsize according to the number of CU * Calculate gridSize according to the number of CU. Remove useless header * Add put example * Remove useless header * Fix CI fail
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enable gemm_dl and other kernels on Navi3x. (Enable gemm_dl and other kernels on Navi3x. #714)
enable dl kernels on navi3
do not build xdl tests and examples on Navi
run tests before building everything on jenkins
disable gemm_bilinear on gfx1030
add gpu targets to installer on Navi
put tests in the same order as before
reduce the number of navi targets in CI
build CI installed for gfx940 as well
only build for MI300 during QA runs
Pool3d fwd (Pool3d fwd #697)
Expand the base class of pool2d, prepare to share base class with pool3d
Add pool3d device op
Add pool3d f16 example
Refactor the base class. implement generic pooling in the future
clang format
get original index in max pooling
Add outputindex to base class
Fix dimension
Add pooling instance
Use indexType instead
Remove useless header
Extract IndexDataType to template
Extract pooling reference code
clang format
clang format
Fix typo
Add tensor stride
Add missing header
Add index stride and output stride
Refine naming
Add type to base class
Rename file
Use proper size
Fix typo
Refine naming
Modify the argument into vector.
Add max pool profiler
Refine naming
Support f32 pool
Fix typo
Add avg pool2d fwd in profiler
clang format
Rename AccDatatype to ComputeDatatype
Fix init
test pool
Extract variable
Add client example
Check the pooling dim
clang format
Connect argv and arg_parser
Add found check
Remove useless header
Refine naming
Adjust the order of device_pool_fwd
Clean-up the headers (Clean-up the headers #713)
fix headers for gpu instances
remove unused headers
Add instances for fp16/int8 Gemm kernels (Navi21) (Add instances for fp16/int8 Gemm kernels (Navi21) #717)
Add instances for fp16/int8 Gemm kernels (Navi21)
Extend instances with smaller tiles
Fix SrcVectorTensor for km_kn_mn int8
Multiple fixes to GroupedGemm+SplitK (Multiple fixes to GroupedGemm+SplitK #707)
Add license header.
Reduce number of logged output. Add constant initialization.
Add functional tests for grouped_gemm with different kbatch value.
Add debug log informations + remove unused code.
Don't pass kbatch to CalculateKPadded.
Turn on logging in grouped gemm and gemm splitk profiler
Debug: limit number of test cases to run;
Log more information and initialize with constant value.
Turn on DEBUG_LOG
Add more debug log informations.
Limit the number of instances to compile.
Use GridwiseGemmPipeline
Use KBatch to calculate K0
Multiple DebugLog messages.
Unit tests for multiple KBatch values.
Refactoring
Disable logging
extract out of if statement KBatch update.
Uncomment instances.
Disable DebugLog.
Use Kbatch when calculate KPadded.
Fix CGridDesc padding.
Use available helper functions.
Uncomment code commented for debuggin.
Remove unnecessary debug log messages.
Uncomment previously commented code for debug purposes.
Add KBatch info to profiler output summary log.
Add gtests for gemm splitk using ckProfiler API.
Add more test-cases for different data layout.
Add more test cases for gemm splitk
Remove old test.
Unit tests for MKNK ggemm interface.
Fix and add more unit-tests.
Constepxr everything!
Increase error threshold for fp16 and splitk.
Since we're using fp16 atomic add for splitk there's a known precision loss.
Simplify kernel argument of device operator DeviceGemm_Xdl_CShuffle<> (Simplify kernel argument of device operator DeviceGemm_Xdl_CShuffle<> #696)
Remove M/N/KPad local variables
Use M/N/KPad to name padded lengths
Replace duplicated local variable by parameters
Rename variables M/N/KRaw to M/N/K
Move AK0/BK0 compute logic into GridwiseGemm
Use macro to shorten code
Move CalculateGridSize() logic into GridwiseGemm
Add comment to credit the implementation source
Reuse the existing implementation
Remove no-longer used data members
Remove elementwise-op objects from interfaces
Reserve kernel arg as whole object in interfaces
Remove redundant data member
Make 3rd type parameter optional
Remove unnesscary type parameters
Remove no-longer used descriptor-creation methods
Move kernel arg type definition into GridwiseGemm
Add macro to switch between code sections
Move argument field computing logic into device op side
Make utility method 'static'
Declare special methods
Unify MakeArgument() usage
Adapt the new GridwiseGemm interface
Push-down class 'GridwiseGemm::Argument' fields
Remove no-longer used methods
Add unused parameters
Force copying parameters in 'Embed' ctor
Remove no-longer used descriptors
Fallback change on BaseArgument
Remove macro 'INTEGER_DIVIDE_CEIL'
Make variable naming more consistent
Make sure methods are only invoked on right place
Remove tailing underscore in public attribute name
Remove necessary methods
Hide computing logic of derived attributes
Make new 'Embed' ctor only available for device code
Make sure 'Embed' type args are not references
Move check for karg.K into CheckValidity()
Remove more integer division logic form device code
Undo changes on Embed
Separate 'Problem' concept out from 'Argument'
Share same name for kernel interfaces
Reject unsupported argument
fix wmma gemm int8; add grouped conv int8 example ([Navi3x bug fix] fix wmma gemm int8; add grouped conv int8 example #716)
Add class type support for __builtin_amdgcn_readfirstlane() (Add class type support for __builtin_amdgcn_readfirstlane() #711)
Add overloaded version of __builtin_amdgcn_readfirstlane()
Remove 'static' specifiers
Remove more 'static' specifier
Replace unsigne char by std::byte
Add 'const' specifier to never changing variable
Add 'inline' specifier to funcion definition
Fix wrong boundar calculation logic
Rename type trait
Remove std:: qualifier from standard types
Replace 'size_t' by 'unsigned'
Use type alias to hint usage
Replace static_for<> by ordinary 'for' loop
Rename readfirstlane() to amd_wave_read_first_lane()
Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
Reorder statements
update copyright headers (update copyright headers #726)
Simplify kernel argument of device operator Device(Batched)GemmXdl<> (Simplify kernel argument of device operator Device(Batched)GemmXdl<> #723)
Remove M/N/KPad local variables
Use M/N/KPad to name padded lengths
Replace duplicated local variable by parameters
Rename variables M/N/KRaw to M/N/K
Move AK0/BK0 compute logic into GridwiseGemm
Use macro to shorten code
Move CalculateGridSize() logic into GridwiseGemm
Add comment to credit the implementation source
Reuse the existing implementation
Remove no-longer used data members
Remove elementwise-op objects from interfaces
Reserve kernel arg as whole object in interfaces
Remove redundant data member
Make 3rd type parameter optional
Remove unnesscary type parameters
Remove no-longer used descriptor-creation methods
Move kernel arg type definition into GridwiseGemm
Add macro to switch between code sections
Move argument field computing logic into device op side
Make utility method 'static'
Declare special methods
Unify MakeArgument() usage
Adapt the new GridwiseGemm interface
Push-down class 'GridwiseGemm::Argument' fields
Remove no-longer used methods
Add unused parameters
Force copying parameters in 'Embed' ctor
Remove no-longer used descriptors
Fallback change on BaseArgument
Remove macro 'INTEGER_DIVIDE_CEIL'
Make variable naming more consistent
Make sure methods are only invoked on right place
Remove tailing underscore in public attribute name
Remove necessary methods
Hide computing logic of derived attributes
Make new 'Embed' ctor only available for device code
Make sure 'Embed' type args are not references
Move check for karg.K into CheckValidity()
Remove more integer division logic form device code
Undo changes on Embed
Separate 'Problem' concept out from 'Argument'
Add overloaded version of __builtin_amdgcn_readfirstlane()
Remove 'static' specifiers
Remove more 'static' specifier
Replace unsigne char by std::byte
Add 'const' specifier to never changing variable
Add 'inline' specifier to funcion definition
Share same name for kernel interfaces
Fix wrong boundar calculation logic
Leave the third template arg for compatibility
Remove unnecessary parameters
Fix wrong error message (for type name)
Create descriptor on device side
Fix wrong debug message
Remove no-longer used data members
Rename type trait
Remove std:: qualifier from standard types
Replace 'size_t' by 'unsigned'
Use type alias to hint usage
Replace static_for<> by ordinary 'for' loop
Reject unsupported argument
Rename readfirstlane() to amd_wave_read_first_lane()
Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
Update function calls
Reorder statements
Re-format files
replace hipMemcpy with hipMemcpyWithStream (replace hipMemcpy with hipMemcpyWithStream #734)
fix clang format (fix clang format #740)
Update docker (Update docker #744)
update dockerfile to build rocm5.6 rc3
fix couple of docker issues
support dynamic buffer using memory coherence glc_slc bit from template (support dynamic buffer using memory coherence glc_slc bit from template #725)
Fix flash attn mask bug (Fix flash attn mask bug #733)
add check input parameter
add instance for vector load = 1
move gerneral instance to first pos
fix read bias code
regular code for bias load
Fix incomplete object size (=4n + 3) support of amd_wave_read_first_lane() (Fix incomplete object size (=4n + 3) support of amd_wave_read_first_lane() #738)
Fix wrong pointer type
Rename type trait get_unsigned_int<> to get_carrier<>
Add 3-bytes carrier type
Add missing device specifier
Rename template non-type parameter
Leave the rest byte uninitialized
Avoid invoking (host) STL algorithms
Remove unnecessary 'inline' specifier
Extract common logic out as helper method
Hide dummy member function
Add missing device specifier
Add DeviceBatchedGemmMultipleD_Dl (Add DeviceBatchedGemmMultipleD_Dl #732)
Add DeviceBatchedGemmMultipleD_Dl
Fix batched_gemm tests
Fix comments
test_batched_gemm_multi_d fixes
Fix args for isSupported batchedGemmMultipleDDl
Disable tests for gfx90a
Fix arg order (Minor bug - fix args #751)
Add generic kernel instances for ck::tensor_operation::device::DeviceGemmMultipleD (Add generic kernel instances for ck::tensor_operation::device::DeviceGemmMultipleD #741)
Add generic instance gemm_add_add_fastgelu
Add a client example for generic gemm_add_add_fastgelu
Update CMakeLists
Format
Format
Add generic instance gemm_add_fastgelu
Format
Add a gemm_add_fastgelu client example
Format
Add generic instance gemm_fastgelu
Format
Fix argument order
Add gemm_fastgelu client example
Add exceptions if argument is not supported
Fix the daily CI job with latest staging compiler. (Fix the daily CI job with latest staging compiler. #753)
fix CI builds with latest staging compiler
remove mount flags from dockerfile
Using number of compute units to set gridSize (Using number of compute units to set gridSize #754)
Add getAvailableComputeUnitCount() interface
Use available number of compute units to set kernel grid size
Fixed Weight layout of grouped_conv 3d fwd (Fixed Weight layout of grouped_conv 3d fwd #743)
Changed wei layout
changed layout for examples
fixed client example
Enable gfx941 and gfx942 architectures. (Enable gfx941 and gfx942 architectures. #752)
enable gfx941/942 targets
fix clang format
fix the cmake logic for multiple targets
fix cmake syntax for looping over targets
add gfx941/942 support for gemm_xdl instances