Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Adaptation of the new quantization method for mkldnn. #37422

Closed
baoachun opened this issue Nov 22, 2021 · 16 comments
Closed

[Feature] Adaptation of the new quantization method for mkldnn. #37422

baoachun opened this issue Nov 22, 2021 · 16 comments
Assignees
Milestone

Comments

@baoachun
Copy link
Contributor

baoachun commented Nov 22, 2021

  1. For output activation of ops, slim will insert QuantizeLinear and DequantizeLinear operation which named after quantize_linear and dequantize_linear.

图片

  1. For trainable parameters of ops, slim first quantizes and saves the weight in low bits, and then inserts an dequantize operation before entering op for calculation.

图片

  1. quantize_linear
    The quantization process requires two parameters, scale and zero_point, both of which are 1-dimensional tensors.

    The quantitative formula is: y=saturate(round(x/scale)+zero_point).
  • Attributes:

    • quant_axis: INT32, optional.
      In the per-axis quantification method, the axis on which the dimension is quantified. If this property is not set, the quantization method defaults to per-layer quantization. For convolution input [batch, channel, H, W], when the quantization method is channel-wise, quant_axis is 1.
    • bit_length: INT32, default is 8.
      The number of bits occupied by the quantized numerical precision, currently only supports quantization as a signed integer.
  • Inputs:

    • X: FP32.
    • Scale: FP32
      When the quantization method is layer-wise, the size of the Scale is 1. When the quantization method is axis-wise, the size of the Scale is equal to the size of the input tensor in the axis dimension.
    • ZeroPoint: INT32, optional.
      The size is the same as the Scale. The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.
  • Outputs:

    • Y: INT32.
      The shape of Y is the same as X. The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.
  1. dequantize_linear
    According to scale, zero_point and quant_axis, the low-precision value Tensor is inversely quantized into a high-precision value Tensor.
    The de-quantitative formula is: y=(x-zero_point)*scale.
  • Attributes:

    • quant_axis: INT32, optional.
      In the per-axis de-quantification method, the axis on which the dimension is de-quantified. If this property is not set, the de-quantization method defaults to per-layer de-quantization. For convolution input [batch, channel, H, W], when the de-quantization method is channel-wise, quant_axis is 1.
    • bit_length: INT32, default is 8.
      The number of bits occupied by the quantized numerical precision, currently only supports quantization as a signed integer.
  • Inputs:

    • X: INT32.
      The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.
    • Scale: FP32
      When the de-quantization method is layer-wise, the size of the Scale is 1. When the de-quantization method is axis-wise, the size of the Scale is equal to the size of the input tensor in the axis dimension.
    • ZeroPoint: INT32, optional.
      The size is the same as the Scale. The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.
  • Outputs:

    • Y: FP32.
      The shape of Y is the same as X.

Testing model: http: https://paddle-inference-dist.bj.bcebos.com/temp/quantized_mobilenetv1.tar.gz

Refer to the definition of quantization operation in ONNX:
quantizelinear
DequantizeLinear

@baoachun baoachun changed the title Adaptation of the new quantization method. Adaptation of the new quantization method for mkldnn. Nov 22, 2021
@paddle-bot-old
Copy link

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@baoachun baoachun changed the title Adaptation of the new quantization method for mkldnn. [Feature] Adaptation of the new quantization method for mkldnn. Nov 22, 2021
@lidanqing-intel lidanqing-intel added this to the Q4 milestone Nov 23, 2021
@lidanqing-intel
Copy link
Contributor

@lidanqing-intel
Copy link
Contributor

  1. Change the save_quant_model to fit this new quant model
  2. Then transform python file to C++ pass.

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Dec 5, 2021

Hi, @baoachun I can not download this model, the URL looks not reachable from outside. Could you please try drag the model into comment box?

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Dec 24, 2021

quantize_linear detected.
channel-wise quantization detected.
scale_var_name: batch_norm_8.tmp_4.scale scale_value: [0.001]
channel-wise show up in dequantize_linear op
image

image

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Dec 27, 2021

+    conv_attr.set_zero_points(DNNL_ARG_SRC, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_DST, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_WEIGHTS, /* mask */ 0, {1});

@lidanqing-intel
Copy link
Contributor

@wozna

@baoachun
Copy link
Contributor Author

python->C++ move to new PR: #38643

@baoachun
Copy link
Contributor Author

@lidanqing-intel @wozna I am wondering whether we can refer to the quantization pass of the GPU to do the mkldnn quantization pass scheme. Because there are a lot of weight processing in the current Python script, it is not in line with the design function of pass, and the code is difficult to maintain. For example, this pass involves transferring data between multiple passes. As far as I know, this situation is currently not supported. I can only save the information in the graph. In addition, there are many weight operations in the pass, such as reshape, transpose, etc., which will bring a huge workload to our later maintenance. In view of this, I still hope that we re-discuss and plan the implementation plan. Thanks!

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Jan 12, 2022

Tobe refined:
Concerns:

  1. Ask issue about Scale propagation

  2. zero-point issue, it is fine,dnnl::reorder support it.

+    conv_attr.set_zero_points(DNNL_ARG_SRC, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_DST, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_WEIGHTS, /* mask */ 0, {1});
  1. GRU does not support signed int8. We need to recalculate. Is it true that the quant model input is always singed int8. Even it is unsigned, it will be noted through zero-point. What if we get some value different from 128. That will only mean quant model is wrong, hence We should use "assert zero-point 128" in GRU quant model.

  2. Those passes and its influence on collected scales. Feel like now we have to change some passes. Like squash, scale etc. Like for scale_conv fuse passes. we have to recalculate the scales. So we will need int8 passes ?? We will have to figure out how many passes will be influence the existing changes.

Steps:

  1. We do this in passes. We get scales and zero-point. We should put already these values as attributes of this op.

  2. Second pass. Remove all fake quant and fake dequant.

  3. Running all mkldnn int8 passes. And at the end of whole process, we can propogate the scales that is propogatable. Look for typical ops (stored propogatable_op_lists) How to mark in the graph. Then we need dequantize op somewhere or use force_fp32_output. Only question is how to propogate the scales, it could be done in similar way in quant2, but could be different, because before we will have done the propogation, we were saving scales for each tensor. Even reshape and transpose will also be done like this. Before we put input and output scale in tensor, but now where do we put the scale. The only concern is that is worrying the method that how to finish int8 pattern. How to finalize int8 calculation and reorder to fp32.

    1. fp32 mkldnn passes (actually int8 mkldnn passes, which means we should already consider ops fusion caused scales recalculation)
    2. insert quantize, dequantize around convs, cpu_quantize_pass, (Before we get tensor with scales from quant2_mkldnn_pass, now we don't use tensor, have to reconsider, currently it should be in target op In_Scale attribute value) We will still need cpu_quantize_squash_pass to safely finalize all int8 calculation

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Jan 19, 2022

Confirmed: oneDNN does support asymetric quantization. https://oneapi-src.github.io/oneDNN/dev_guide_convolution.html
Now adding zero-points into Paddle conv mkldnn op
https://github.com/oneapi-src/oneDNN/blob/master/tests/benchdnn/doc/knobs_attr.md

We should just reuse GPU GPU passes:

const std::vector<std::string> kTRTSubgraphPasses({
  "conv_affine_channel_fuse_pass",  //
      "adaptive_pool2d_convert_global_pass",
      "conv_eltwiseadd_affine_channel_fuse_pass",  //
      "shuffle_channel_detect_pass",               //
      "quant_conv2d_dequant_fuse_pass",            //
      "delete_quant_dequant_op_pass",              //
      "delete_quant_dequant_filter_op_pass",       //
      // "fc_fuse_pass",                                 //
      "simplify_with_basic_ops_pass",           //
      "embedding_eltwise_layernorm_fuse_pass",  //
      "multihead_matmul_fuse_pass_v2",          //
      "multihead_matmul_fuse_pass_v3",          //
      "skip_layernorm_fuse_pass",               //
      "conv_bn_fuse_pass",                      //
      "unsqueeze2_eltwise_fuse_pass",           //
      "squeeze2_matmul_fuse_pass",              //
      "reshape2_matmul_fuse_pass",              //
      "flatten2_matmul_fuse_pass",              //
      "map_matmul_v2_to_mul_pass",              //
      "map_matmul_v2_to_matmul_pass",           //
      "map_matmul_to_mul_pass",                 //
      "fc_fuse_pass",                           //
      "conv_elementwise_add_fuse_pass",         //
      "add_support_int8_pass",
      "tensorrt_subgraph_pass",  //
      "conv_bn_fuse_pass",       //
#if CUDNN_VERSION >= 7100  // To run conv_fusion, the version of cudnn must be
                           // guaranteed at least v7
// cudnn8.0 has memory leak problem in conv + eltwise + act, so we
// disable the pass.
#if !(CUDNN_VERSION >= 8000 && CUDNN_VERSION < 8100)
      "conv_elementwise_add_act_fuse_pass",   //
      "conv_elementwise_add2_act_fuse_pass",  //
#endif
#endif
      "transpose_flatten_concat_fuse_pass",
});

@yaomichael
Copy link

notes from 5/20 meeting:
@wozna is taking further optimization from @lidanqing-intel, and expect to deliver in early Q3.

@wozna wozna removed their assignment Aug 2, 2022
@wozna
Copy link
Contributor

wozna commented Sep 2, 2022

Hi, @baoachun @yeliang2258 I'm working on this issue on changes made previously by Danqing here #42106 where there is still an accuracy problem. From what I can see, one-scale gathering is missing there, and that's probably why it doesn't work properly. WIP
I'm also doing a review for #45416 (review) where quantize_linear and dequantize_linear are also used and it is called ONNX Format. Additional element is ScaleFilePath here. Do you also know if there are any differences?

@yeliang2258
Copy link
Contributor

@wozna Hi, ScaleFilePath stores the scale information of all tensors in the quantize model. This file was added recently. Maybe the problem of lack of scale you mentioned can be solved after loading this file.

@wozna
Copy link
Contributor

wozna commented Sep 30, 2022

This new quantization method was done by @yeliang2258 in #45416.
There are to PR also connected to this #45920 and #46378.

ZeroPoint is not done yet. OneDNN has this option so it is possible to add it. @yeliang2258 do you know if any of new models use this ZeroPoint value for data shift?

@yaomichael
Copy link

This new quantization method was done by @yeliang2258 in #45416. There are to PR also connected to this #45920 and #46378.

ZeroPoint is not done yet. OneDNN has this option so it is possible to add it. @yeliang2258 do you know if any of new models use this ZeroPoint value for data shift?

@wozna I consulted @yeliang2258 and he confirmed PaddleSlim (the team for quant models) supports only symmetric quantization, so all models should have zeropoint be 0.

@paddle-bot paddle-bot bot added the status/close 已关闭 label Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants