[GPU] Add initial SDPA implementation #24466

sshlyapn · 2024-05-10T13:18:48Z

Details:

Add initial SDPA implementation, input transpose fusions and GQA related optimization (broadcast fusion)

rkazants

@itikhono, @sshlyapn,
do we expect that https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention that corresponds to SDPA will be fused into SDPA internal operation on CPU and GPU?

sshlyapn · 2024-05-14T19:29:32Z

@itikhono, @sshlyapn, do we expect that https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention that corresponds to SDPA will be fused into SDPA internal operation on CPU and GPU?

@rkazants I think it makes sense, but currently not all the cases are well optimized, so if it will have an ability to decompose it back to current representation for some cases, sounds good

…ation

ilya-lavrenov · 2024-05-22T09:22:14Z

src/inference/include/openvino/runtime/intel_gpu/properties.hpp

+ * @brief Turning on this key disables SDPA operation decomposition and keeps SDPA operation in the graph.
+ * Enabling SDPA optimization may provide performance improvements and memory usage reduction.
+ * This key serves as a recommendation and may be ignored in known sub-optimal cases.
+ * @ingroup ov_runtime_ocl_gpu_prop_cpp_api


what is the default value?

@ilya-lavrenov currently it's disabled by default. However, in the final version, it will depend on whether support for indirect inputs is implemented for SDPA in time or not

This allows to switch on for models where indirect inputs are not required

vladimir-paramuzov

No critical issues found so far. Feel free to follow up the comments in the next PR

vladimir-paramuzov · 2024-05-22T09:56:18Z

src/plugins/intel_gpu/src/graph/impls/ocl/scaled_dot_product_attention.cpp

+        return params;
+    }
+
+    static std::unique_ptr<primitive_impl> create(const typed_program_node<scaled_dot_product_attention>& arg, const kernel_impl_params& impl_param) {


Does this impl have any difference with common version in base class?

vladimir-paramuzov · 2024-05-22T09:59:24Z

src/plugins/intel_gpu/src/kernel_selector/kernels/sdpa/sdpa_kernel_opt.cpp

+                                  CeilDiv(target_seq_len, target_seq_len_block_size),
+                                  head_size * num_of_partitions };
+            dispatch_data.lws = { 1, 1, head_size };
+        } else if (kernel_idx == 2) {


Probably it should be kernel_idx == KernelsTypes::FINALIZATION

vladimir-paramuzov · 2024-05-22T10:41:54Z

src/plugins/intel_gpu/src/graph/scaled_dot_product_attention.cpp

+                                                             kernel_impl_params const& impl_param) {
+    auto desc = impl_param.typed_desc<scaled_dot_product_attention>();
+
+    return impl_param.get_input_layout(0);


I suppose this impl is incorrect, so maybe just throw unimplemented exception here intead?

vladimir-paramuzov · 2024-05-22T10:52:33Z

src/plugins/intel_gpu/src/plugin/transformations/transpose_fusion.cpp

+        } else if (pattern_map.find(sdpa_with_attn_mask_m) != pattern_map.end()) {
+            sdpa = std::dynamic_pointer_cast<ov::op::v13::ScaledDotProductAttention>(pattern_map.at(sdpa_with_attn_mask_m).get_node_shared_ptr());
+        } else if (pattern_map.find(sdpa_with_attn_mask_and_scale_m) != pattern_map.end()) {
+            sdpa = std::dynamic_pointer_cast<ov::op::v13::ScaledDotProductAttention>(pattern_map.at(sdpa_with_attn_mask_and_scale_m).get_node_shared_ptr());


I think code above can be replaced with m.get_match_root()

vladimir-paramuzov · 2024-05-22T10:54:10Z

src/plugins/intel_gpu/src/plugin/transformations/transpose_fusion.cpp

+        } else if (pattern_map.find(sdpa_with_attn_mask_and_scale_m) != pattern_map.end()) {
+            auto attn_mask = sdpa->get_input_source_output(3);
+            auto scale = sdpa->get_input_source_output(4);
+            sdpa_new = std::make_shared<op::SDPA>(input_q, input_k, input_v, attn_mask, scale, order_q, order_k, order_v, order_output, sdpa->get_causal());


We can probably have internal SDPA op with optional inputs to simplify the converters.

vladimir-paramuzov · 2024-05-22T11:05:52Z

src/plugins/intel_gpu/src/kernel_selector/kernels/sdpa/sdpa_kernel_opt.cpp

+JitConstants SDPAKernelOpt::GetJitConstants(const sdpa_params& params, size_t kernel_idx) const {
+    auto jit = SDPAKernelBase::GetJitConstants(params);
+
+    const auto softmax_acc_dt = params.inputs[0].GetDType();


Would be great to try FP32 accumulator. If it doesn't impact perf, then we can use it to have better accuracy in some cases

vladimir-paramuzov · 2024-05-22T11:08:12Z

src/plugins/intel_gpu/src/kernel_selector/kernels/sdpa/sdpa_kernel_opt.cpp

+}
+
+KernelsPriority SDPAKernelOpt::GetKernelsPriority(const Params& /*params*/) const {
+    return FORCE_PRIORITY_1;


Should probably be > 1, at least for platforms with dpas

vladimir-paramuzov · 2024-05-22T11:19:17Z

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/sdpa_opt.cl

+#elif INPUT1_DIMS == 6
+    return INPUT1_GET_INDEX_SAFE(b, f, w, z, y, x);
+#else
+#   error sdpa_ref.cl : Unsupported input 1 format


vladimir-paramuzov · 2024-05-22T11:20:19Z

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/sdpa_opt.cl

+    DO_BROADCAST_KEY_VALUE;
+#endif
+#if INPUT1_SIMPLE
+    return GET_DATA_INDEX_6D_SAFE(INPUT1, b, f, w, z, y, x);


Should we use _SAFE version here? As I understand, DO_BROADCAST_KEY_VALUE divides index by group size internally, so do we have out of bound indices somehow? Same question for other functions

vladimir-paramuzov · 2024-05-22T11:28:54Z

src/plugins/intel_gpu/src/kernel_selector/cl_kernels/sdpa_opt.cl

+                exp_sum[seq_idx] = sub_group_reduce_add(exp_sum[seq_idx]);
+            }
+
+            // const SOFTMAX_ACCUMULATOR_TYPE inv_exp_sum = SOFTMAX_ACCUMULATOR_VAL_ONE / exp_sum[seq_idx];


not needed I think

sshlyapn added the category: GPU OpenVINO GPU plugin label May 10, 2024

sshlyapn added this to the 2024.2 milestone May 10, 2024

sshlyapn requested review from a team as code owners May 10, 2024 13:18

sshlyapn force-pushed the sdpa_impl branch 3 times, most recently from 75c4dd0 to 280cc81 Compare May 10, 2024 17:35

rkazants reviewed May 10, 2024

View reviewed changes

sshlyapn force-pushed the sdpa_impl branch 2 times, most recently from c887523 to 63fd7cc Compare May 21, 2024 06:12

[GPU] Add SDPA impl; SDPA input transpose fusion support; GQA optimiz…

d07d200

…ation

sshlyapn force-pushed the sdpa_impl branch from 63fd7cc to d07d200 Compare May 21, 2024 06:34

p-durandin added the Code Freeze label May 21, 2024

sshlyapn force-pushed the sdpa_impl branch from 2f40bac to 4a3931c Compare May 21, 2024 15:03

p-durandin removed the Code Freeze label May 21, 2024

Fix code style

013e0cf

sshlyapn force-pushed the sdpa_impl branch from 4a3931c to 013e0cf Compare May 22, 2024 05:45

sshlyapn requested a review from a team as a code owner May 22, 2024 07:35

github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPP API OpenVINO CPP API bindings labels May 22, 2024

sshlyapn force-pushed the sdpa_impl branch from 76f4f1b to 74d8c14 Compare May 22, 2024 07:37

Add custom property and fix debug build issue

a7002c5

sshlyapn force-pushed the sdpa_impl branch from 74d8c14 to a7002c5 Compare May 22, 2024 07:46

p-durandin added Code Freeze priority: high High piority labels May 22, 2024

ilya-lavrenov reviewed May 22, 2024

View reviewed changes

Remove SDPA tests from skip_config

b51bc3b

vladimir-paramuzov approved these changes May 22, 2024

View reviewed changes

p-durandin added this pull request to the merge queue May 22, 2024

Merged via the queue into openvinotoolkit:master with commit 1e5f025 May 22, 2024
101 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Add initial SDPA implementation #24466

[GPU] Add initial SDPA implementation #24466

sshlyapn commented May 10, 2024 •

edited

rkazants left a comment •

edited

sshlyapn commented May 14, 2024

ilya-lavrenov May 22, 2024

sshlyapn May 22, 2024

p-durandin May 22, 2024

vladimir-paramuzov left a comment

vladimir-paramuzov May 22, 2024

vladimir-paramuzov May 22, 2024

vladimir-paramuzov May 22, 2024

vladimir-paramuzov May 22, 2024

vladimir-paramuzov May 22, 2024

vladimir-paramuzov May 22, 2024

vladimir-paramuzov May 22, 2024

vladimir-paramuzov May 22, 2024

vladimir-paramuzov May 22, 2024

vladimir-paramuzov May 22, 2024

[GPU] Add initial SDPA implementation #24466

[GPU] Add initial SDPA implementation #24466

Conversation

sshlyapn commented May 10, 2024 • edited

Details:

rkazants left a comment • edited

Choose a reason for hiding this comment

sshlyapn commented May 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladimir-paramuzov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshlyapn commented May 10, 2024 •

edited

rkazants left a comment •

edited