conv3d w/out 4GB limitation#60
Conversation
|
Need a ctest like this #58 |
|
@asroy Should we keep the conv3d separate from conv2d? |
…ct to 2G limitation
…conv3d; formatting and cleanup
|
@asroy I've resolved the conflicts. The performance dropped a lot though. |
|
Plan to review code with @asroy tomorrow |
|
@asroy Performance degradation was caused by using if-statement in dynamic buffer to avoid invalid reading, so I recovered to current state, and now the performance is about 99% of conv2d if the same problem is solved. The branches in ISA are caused by
|
|
Here's the buffer_load issue https://ontrack-internal.amd.com/browse/SWDEV-319513 |
| DoMagicDivision(int32_t dividend_i32, uint32_t multiplier, uint32_t shift) | ||
| { | ||
| uint32_t dividend_u32 = bit_cast<uint32_t>(dividend_i32); | ||
| uint32_t tmp = static_cast<unsigned long long>(dividend_u32) * multiplier >> 32; |
| typename ConvDilations, | ||
| typename InLeftPads, | ||
| typename InRightPads> | ||
| void host_conv3d_ndhwc_kzyxc_ndhwk(const Tensor<TIn>& in, |
There was a problem hiding this comment.
This need to be replace by this ReferenceConvFwd
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/3dbb8b4078e5f3bdf5e49aa7549a9c069cd14502/reference_operation/include/reference_conv_fwd.hpp#L38
once branch aosewski/conv_nd is merged
There was a problem hiding this comment.
It looks weird to me that host::ReferenceConvFwd inherit from device::BaseOperator.
There was a problem hiding this comment.
Yes, host::ReferenceConvFwd is an after-thought. We need to re-org their dependency. You can create an issue and assign to me
| @@ -0,0 +1,106 @@ | |||
| #ifndef NAIVE_CONV_FWD_HPP | |||
There was a problem hiding this comment.
This need to be wrapped inside a Device operator class.
There was a problem hiding this comment.
Could you elaborate on this?
There was a problem hiding this comment.
you can write a class DeviceConvolutionNaive, which will call this kernel underneath
| index_t GemmK1Value> | ||
| __host__ __device__ constexpr auto | ||
| transform_forward_convolution3d_into_gemm_v4r4r4_nhwc_kyxc_nhwk_pad_split_batch( | ||
| // const TensorDescriptor<In...>& in_grid_desc_n_di_hi_wi_c, |
| make_pass_through_transform(N), | ||
| make_pass_through_transform(K1)), | ||
| make_tuple(Sequence<0>{}, Sequence<1>{}, Sequence<2>{}, Sequence<3>{}), | ||
| make_tuple(Sequence<0>{}, Sequence<1>{}, Sequence<2>{})); |
| make_pass_through_transform(M), | ||
| make_pass_through_transform(N)), | ||
| make_tuple(Sequence<0>{}, Sequence<1>{}, Sequence<2>{}), | ||
| make_tuple(Sequence<0>{}, Sequence<1>{})); |
| typename CThreadTransferSrcDstAccessOrder, | ||
| index_t CThreadTransferSrcDstVectorDim, | ||
| index_t CThreadTransferDstScalarPerVector> | ||
| struct GridwiseBatchedGemm_bk0mk1_k0nk1_bmn_xdlops_v2r3 |
There was a problem hiding this comment.
inconsistent file name and struct name v2r3 v2r3r3
| } | ||
|
|
||
| __host__ __device__ static constexpr auto | ||
| MakeAGridDescriptor_K0_M_K1(const AGridDesc_B_K0_M_K1& a_grid_desc_b_k0_m_k1, const int bb) |
There was a problem hiding this comment.
int and long long is not allowed in index calculation.
please use index_t and long_index_t only
| template <long_index_t N> | ||
| using LongNumber = integral_constant<long_index_t, N>; | ||
|
|
||
| template <typename Index0, |
There was a problem hiding this comment.
template <typename Index0,
Index0 X,
typename Index1,
Index1 Y>
__host__ __device__ constexpr auto operator+(integral_constant<Index0, X>,
integral_constant<Index1, Y>)
{
constexpr auto Z = X + Y;
return integral_constant<decltype(Z), Z>{};
}
There was a problem hiding this comment.
That's indeed better. I moved the operators into integral_constant.hpp, leaving number.hpp almost empty. Should we also move using Number together?
|
|
||
| float ave_time = launch_and_time_kernel(naive_conv3d_fwd, | ||
| nrepeat, | ||
| dim3(256), |
There was a problem hiding this comment.
Any reason to use constant grid size of 256 ?
There was a problem hiding this comment.
No specific reason. The implementation doesn't reply on blocksize and grid size, so I hard-coded them.
| in_left_pads[2] = std::stoi(argv[21]); | ||
| in_right_pads[0] = std::stoi(argv[22]); | ||
| in_right_pads[1] = std::stoi(argv[23]); | ||
| in_right_pads[2] = std::stoi(argv[24]); |
There was a problem hiding this comment.
You can use getopt_long() to help classify and analyze the 25 arguments
There was a problem hiding this comment.
Good to know. But since other examples don't use it, I'll leave it as is. We can change them all together in future if needed.
|
See #94 instead. |
* Re-organize example directories * Move reference operation into sub-folder * Move mask types into dedicated files * Separate utility interface & implementation * Resume pipeline changes in fmha_fwd.cpp * Rename folder 'fmha_fwd' to 'fmha' * Move more function to utils.* * Remove 'fmha_fwd_kernel_invoker.hpp' * Re-format files * Move Kargs types into dedicated file * Fix formating * Fix compilation errors * Avoid instantiating unused types * Extract configurable codes * Add missing include directive * Instantiate template functions outside fmha_fwd.cpp * Separate implementation files * Merge config files * Merge duplicated code * Remove no-longer used file * Unify enum name * Extract no_mask kernel * Further separate template specializations * Use file(GLOB) to get file list * Include needed config file only once * Remove debug message * Add comment to explain template specializations * Move impl files under 'kernels' sub-folder * Only include *.inc in *.inc files * Add extra type arg to control selected kernel * Add kernel specializations for bf16 * Switch kernel according to cmdline options * Re-order type parameters * Reduce loop indent level * Instantiate launch_kernel() * Rename source files * Remove duplicated codes * Remove more duplicated codes * Clean up codes * Rename 'FmhaMaskType' to 'FmhaMasks' * Remove no-longer used include directive * Move template declarations into dedicated header * use python codegen * modify validation logic * format print and add smoke_test script * modify bf16 elimit, add benchmark script --------- Co-authored-by: carlushuang <carlus.huang@amd.com>
Synchronize the kernel changes used by xformers to ck_tile example/91_tile_program/fmha
fix typo in function mha_fwd
3D convolution:
Number64. The performance of conv3D is about 99% of conv2D if the same problem is solved.c_thread_bufinitialization toGridwiseGemm_konk1_mn_xdlops_v2r3.DoMagicDivisionwhich was missing.DynamicBuffer::RunwhenCK_USE_AMD_BUFFER_ADDRESSING=1. buffer_load,amd_buffer_load_invalid_element_return_zeroto be precise, is problematic in 3D convolution.BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2