Skip to content

conv3d w/out 4GB limitation#60

Closed
j4yan wants to merge 33 commits into
ROCm:developfrom
j4yan:conv3d_splitN_rebased
Closed

conv3d w/out 4GB limitation#60
j4yan wants to merge 33 commits into
ROCm:developfrom
j4yan:conv3d_splitN_rebased

Conversation

@j4yan
Copy link
Copy Markdown
Contributor

@j4yan j4yan commented Nov 28, 2021

3D convolution:

  • a batched version by splitting batches into sub-batches, and each sub-batch is within int32 range. This implementation comes with Number64. The performance of conv3D is about 99% of conv2D if the same problem is solved.
  • added c_thread_buf initialization to GridwiseGemm_konk1_mn_xdlops_v2r3.
  • added host version of DoMagicDivision which was missing.
  • disable buffer_load in DynamicBuffer::Run when CK_USE_AMD_BUFFER_ADDRESSING=1. buffer_load, amd_buffer_load_invalid_element_return_zero to be precise, is problematic in 3D convolution.
  • fixed a bug in BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2
  • added example conv3d_fwd_xdl.

@j4yan j4yan requested review from asroy and zjing14 November 28, 2021 20:11
@asroy
Copy link
Copy Markdown
Contributor

asroy commented Nov 30, 2021

Need a ctest like this #58

Comment thread .gitignore Outdated
@j4yan j4yan changed the title Conv3d split n rebased conv3d w/out 4GB limitation Dec 14, 2021
@j4yan
Copy link
Copy Markdown
Contributor Author

j4yan commented Dec 14, 2021

@asroy @zjing14 rebased example conv3d_fwd_xdl against develop. Ready for review.

@j4yan
Copy link
Copy Markdown
Contributor Author

j4yan commented Dec 14, 2021

@asroy Should we keep the conv3d separate from conv2d?

@j4yan j4yan requested a review from asroy December 18, 2021 05:37
@j4yan
Copy link
Copy Markdown
Contributor Author

j4yan commented Jan 7, 2022

@asroy I've resolved the conflicts. The performance dropped a lot though.

zjing14
zjing14 previously approved these changes Jan 11, 2022
Copy link
Copy Markdown
Contributor

@zjing14 zjing14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aserio
Copy link
Copy Markdown

aserio commented Jan 14, 2022

Plan to review code with @asroy tomorrow

@j4yan
Copy link
Copy Markdown
Contributor Author

j4yan commented Jan 17, 2022

@asroy Performance degradation was caused by using if-statement in dynamic buffer to avoid invalid reading, so I recovered to current state, and now the performance is about 99% of conv2d if the same problem is solved. The branches in ISA are caused by

  • the outer loop over batches in GridwiseGemm
  • the ternary operator in dynamic buffer if CK_EXPERIMENTAL_USE_MEMCPY_FOR_VECTOR_ACCESS=0

Copy link
Copy Markdown
Contributor

@asroy asroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see inline comments.

Please open compiler JIRA for buffer_load issue

For future PR, please:

  1. use a branch from inside CK repo, which is easier for reviewer to switch to and do test
  2. use clang-format-10 before creating PR

@j4yan
Copy link
Copy Markdown
Contributor Author

j4yan commented Jan 28, 2022

Please see inline comments.

Please open compiler JIRA for buffer_load issue

For future PR, please:

  1. use a branch from inside CK repo, which is easier for reviewer to switch to and do test
  2. use clang-format-10 before creating PR

@j4yan j4yan closed this Jan 28, 2022
@j4yan j4yan reopened this Jan 28, 2022
@j4yan
Copy link
Copy Markdown
Contributor Author

j4yan commented Jan 28, 2022

Please see inline comments.

Please open compiler JIRA for buffer_load issue

For future PR, please:

  1. use a branch from inside CK repo, which is easier for reviewer to switch to and do test
  2. use clang-format-10 before creating PR

Here's the buffer_load issue https://ontrack-internal.amd.com/browse/SWDEV-319513

Comment thread host/driver_offline/include/driver_batched_gemm_xdlops_v2r3.hpp Outdated
Comment thread host/driver_offline/src/conv3d_fwd_driver_offline.cpp Outdated
DoMagicDivision(int32_t dividend_i32, uint32_t multiplier, uint32_t shift)
{
uint32_t dividend_u32 = bit_cast<uint32_t>(dividend_i32);
uint32_t tmp = static_cast<unsigned long long>(dividend_u32) * multiplier >> 32;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64_t

typename ConvDilations,
typename InLeftPads,
typename InRightPads>
void host_conv3d_ndhwc_kzyxc_ndhwk(const Tensor<TIn>& in,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@j4yan j4yan Feb 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks weird to me that host::ReferenceConvFwd inherit from device::BaseOperator.

Copy link
Copy Markdown
Contributor

@asroy asroy Feb 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, host::ReferenceConvFwd is an after-thought. We need to re-org their dependency. You can create an issue and assign to me

@@ -0,0 +1,106 @@
#ifndef NAIVE_CONV_FWD_HPP
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need to be wrapped inside a Device operator class.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can write a class DeviceConvolutionNaive, which will call this kernel underneath

index_t GemmK1Value>
__host__ __device__ constexpr auto
transform_forward_convolution3d_into_gemm_v4r4r4_nhwc_kyxc_nhwk_pad_split_batch(
// const TensorDescriptor<In...>& in_grid_desc_n_di_hi_wi_c,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

make_pass_through_transform(N),
make_pass_through_transform(K1)),
make_tuple(Sequence<0>{}, Sequence<1>{}, Sequence<2>{}, Sequence<3>{}),
make_tuple(Sequence<0>{}, Sequence<1>{}, Sequence<2>{}));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug is still not fixed

make_pass_through_transform(M),
make_pass_through_transform(N)),
make_tuple(Sequence<0>{}, Sequence<1>{}, Sequence<2>{}),
make_tuple(Sequence<0>{}, Sequence<1>{}));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug

typename CThreadTransferSrcDstAccessOrder,
index_t CThreadTransferSrcDstVectorDim,
index_t CThreadTransferDstScalarPerVector>
struct GridwiseBatchedGemm_bk0mk1_k0nk1_bmn_xdlops_v2r3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent file name and struct name v2r3 v2r3r3

}

__host__ __device__ static constexpr auto
MakeAGridDescriptor_K0_M_K1(const AGridDesc_B_K0_M_K1& a_grid_desc_b_k0_m_k1, const int bb)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int and long long is not allowed in index calculation.

please use index_t and long_index_t only

template <long_index_t N>
using LongNumber = integral_constant<long_index_t, N>;

template <typename Index0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template <typename Index0,
          Index0 X,
          typename Index1,
          Index1 Y>
__host__ __device__ constexpr auto operator+(integral_constant<Index0, X>,
                                             integral_constant<Index1, Y>)
{
    constexpr auto Z = X + Y;
    return integral_constant<decltype(Z), Z>{};
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's indeed better. I moved the operators into integral_constant.hpp, leaving number.hpp almost empty. Should we also move using Number together?

Comment thread composable_kernel/include/utility/number.hpp Outdated

float ave_time = launch_and_time_kernel(naive_conv3d_fwd,
nrepeat,
dim3(256),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use constant grid size of 256 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason. The implementation doesn't reply on blocksize and grid size, so I hard-coded them.

in_left_pads[2] = std::stoi(argv[21]);
in_right_pads[0] = std::stoi(argv[22]);
in_right_pads[1] = std::stoi(argv[23]);
in_right_pads[2] = std::stoi(argv[24]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use getopt_long() to help classify and analyze the 25 arguments

Copy link
Copy Markdown
Contributor Author

@j4yan j4yan Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. But since other examples don't use it, I'll leave it as is. We can change them all together in future if needed.

@j4yan
Copy link
Copy Markdown
Contributor Author

j4yan commented Feb 21, 2022

See #94 instead.

@j4yan j4yan closed this Feb 21, 2022
carlushuang added a commit that referenced this pull request Jan 31, 2024
* Re-organize example directories

* Move reference operation into sub-folder

* Move mask types into dedicated files

* Separate utility interface & implementation

* Resume pipeline changes in fmha_fwd.cpp

* Rename folder 'fmha_fwd' to 'fmha'

* Move more function to utils.*

* Remove 'fmha_fwd_kernel_invoker.hpp'

* Re-format files

* Move Kargs types into dedicated file

* Fix formating

* Fix compilation errors

* Avoid instantiating unused types

* Extract configurable codes

* Add missing include directive

* Instantiate template functions outside fmha_fwd.cpp

* Separate implementation files

* Merge config files

* Merge duplicated code

* Remove no-longer used file

* Unify enum name

* Extract no_mask kernel

* Further separate template specializations

* Use file(GLOB) to get file list

* Include needed config file only once

* Remove debug message

* Add comment to explain template specializations

* Move impl files under 'kernels' sub-folder

* Only include *.inc in *.inc files

* Add extra type arg to control selected kernel

* Add kernel specializations for bf16

* Switch kernel according to cmdline options

* Re-order type parameters

* Reduce loop indent level

* Instantiate launch_kernel()

* Rename source files

* Remove duplicated codes

* Remove more duplicated codes

* Clean up codes

* Rename 'FmhaMaskType' to 'FmhaMasks'

* Remove no-longer used include directive

* Move template declarations into dedicated header

* use python codegen

* modify validation logic

* format print and add smoke_test script

* modify bf16 elimit, add benchmark script

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
carlushuang pushed a commit that referenced this pull request Apr 26, 2024
Synchronize the kernel changes used by xformers to ck_tile example/91_tile_program/fmha
hyoon1 pushed a commit to hyoon1/composable_kernel that referenced this pull request Mar 19, 2026
fix typo in function mha_fwd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants