[BugFix] dynamic cache kv block_wise_fp8 not need create layer.cache_k_scale #5362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

yuanlehome merged 1 commit into PaddlePaddle:develop from yuanlehome:fix_blockwise_fp8_scale

Dec 3, 2025

fastdeploy/model_executor/layers/quantization/kv_cache.py

-Original file line number
+Diff line change
@@ Expand Up @@
             else:
                 raise NotImplementedError(f"{self.cache_quant_config.quant_type} is not implemented")
-            scale_shape = [layer.fd_config.model_config.num_key_value_heads]
-            if self.cache_quant_config.is_channel_wise:
-                scale_shape = [layer.kv_num_heads * layer.head_dim]
-            layer.cache_k_scale = layer.create_parameter(
-                shape=scale_shape,
-                dtype=paddle.get_default_dtype(),
-                default_initializer=paddle.nn.initializer.Constant(0),
-            )
-            layer.cache_v_scale = layer.create_parameter(
-                shape=scale_shape,
-                dtype=paddle.get_default_dtype(),
-                default_initializer=paddle.nn.initializer.Constant(0),
-            )
-            set_weight_attrs(
-                layer.cache_k_scale,
-                {
-                    **extra_weight_attrs,
-                },
-            )
-            set_weight_attrs(
-                layer.cache_v_scale,
-                {
-                    **extra_weight_attrs,
-                },
-            )
+            if "block_wise" not in layer.cache_quant_type_str:  # dynamic cache kv block_wise_fp8 not need
+                scale_shape = [layer.fd_config.model_config.num_key_value_heads]
+                if self.cache_quant_config.is_channel_wise:
+                    scale_shape = [layer.kv_num_heads * layer.head_dim]
-            layer.cache_k_out_scale = layer.create_parameter(
-                shape=scale_shape,
-                dtype=paddle.get_default_dtype(),
-                default_initializer=paddle.nn.initializer.Constant(0),
-            )
-            layer.cache_v_out_scale = layer.create_parameter(
-                shape=scale_shape,
-                dtype=paddle.get_default_dtype(),
-                default_initializer=paddle.nn.initializer.Constant(0),
-            )
-            if self.cache_quant_config.has_zero_point:
-                layer.cache_k_zp = layer.create_parameter(
+                layer.cache_k_scale = layer.create_parameter(
                     shape=scale_shape,
                     dtype=paddle.get_default_dtype(),
                     default_initializer=paddle.nn.initializer.Constant(0),
                 )
-                layer.cache_v_zp = layer.create_parameter(
+                layer.cache_v_scale = layer.create_parameter(
                     shape=scale_shape,
                     dtype=paddle.get_default_dtype(),
                     default_initializer=paddle.nn.initializer.Constant(0),
                 )
                 set_weight_attrs(
-                    layer.cache_k_zp,
+                    layer.cache_k_scale,
                     {
                         **extra_weight_attrs,
                     },
                 )
                 set_weight_attrs(
-                    layer.cache_v_zp,
+                    layer.cache_v_scale,
                     {
                         **extra_weight_attrs,
                     },
                 )
+                layer.cache_k_out_scale = layer.create_parameter(
+                    shape=scale_shape,
+                    dtype=paddle.get_default_dtype(),
+                    default_initializer=paddle.nn.initializer.Constant(0),
+                )
+                layer.cache_v_out_scale = layer.create_parameter(
+                    shape=scale_shape,
+                    dtype=paddle.get_default_dtype(),
+                    default_initializer=paddle.nn.initializer.Constant(0),
+                )
+                if self.cache_quant_config.has_zero_point:
+                    layer.cache_k_zp = layer.create_parameter(
+                        shape=scale_shape,
+                        dtype=paddle.get_default_dtype(),
+                        default_initializer=paddle.nn.initializer.Constant(0),
+                    )
+                    layer.cache_v_zp = layer.create_parameter(
+                        shape=scale_shape,
+                        dtype=paddle.get_default_dtype(),
+                        default_initializer=paddle.nn.initializer.Constant(0),
+                    )
+                    set_weight_attrs(
+                        layer.cache_k_zp,
+                        {
+                            **extra_weight_attrs,
+                        },
+                    )
+                    set_weight_attrs(
+                        layer.cache_v_zp,
+                        {
+                            **extra_weight_attrs,
+                        },
+                    )
         def process_loaded_weights(self, layer: nn.Layer, state_dict):
             """
             use for loader v0
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] dynamic cache kv block_wise_fp8 not need create layer.cache_k_scale #5362

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!