arm端opencl输出和cpu输出存在较大误差 #480

banbishan · 2020-10-19T08:28:47Z

上图左边是cpu输出正确，右边是设置opencl后的输出，两者存在较大误差，下面是模型文件，烦请帮忙看下，谢谢。
平台是MT8163，ndk17交叉编译
20201013.zip

darrenyao87 · 2020-10-19T08:42:17Z

目前已知的是OpenCL计算是fp16而arm是fp32，所以肯定存在差异。一方面 @shaundai-tencent 确认下差异是否正常，另一方面 @banbishan 可以看看对效果的影响多大。

banbishan · 2020-10-19T08:46:24Z

目前已知的是OpenCL计算是fp16而arm是fp32，所以肯定存在差异。一方面 @shaundai-tencent 确认下差异是否正常，另一方面 @banbishan 可以看看对效果的影响多大。

是个检测网络，后处理后的效果，检测的框有的两个框变成了一个，还有一小部分漏检。最终的效果目前来看无法使用，cpu的输出是可以使用的，但速度太慢。

shaundai-tencent · 2020-10-19T12:06:14Z

我排查了一下，发现是OpenCL在fp16精度下，global pooling层（419），中间累加结果溢出了。你可以先试一下FP32精度是否满足精度和性能要求。在配置NetworkConfig的时候设置为PRECISION_HIGH，可以开启高精度模式。

banbishan · 2020-10-19T12:20:43Z

我排查了一下，发现是OpenCL在fp16精度下，global pooling层（419），中间累加结果溢出了。你可以先试一下FP32精度是否满足精度和性能要求。在配置NetworkConfig的时候设置为PRECISION_HIGH，可以开启高精度模式。

@shaundai-tencent 我在代码里加上了config.precision = tnn::PRECISION_NORMAL;这句就报了如下的错误，

shaundai-tencent · 2020-10-19T12:39:52Z

我这边没有复现你的报错，我这边设置NORMAL也是可以的。我用的是Mate30Pro测试的。

后续我会修复一下pooling中avg pooling有可能溢出的问题，建议先用Normal的精度执行。

banbishan · 2020-10-20T01:38:53Z

我这边没有复现你的报错，我这边设置NORMAL也是可以的。我用的是Mate30Pro测试的。

后续我会修复一下pooling中avg pooling有可能溢出的问题，建议先用Normal的精度执行。

@shaundai-tencent 我使用normal和high都会报上面这个错，换了使用骁龙835的手机也是同样的，我的代码如下，使用的tnn是刚刚从github上clone的，在SetInputMat时报出上面图里的错，注释掉PRECISION_HIGH不会报错，能麻烦看下哪里的问题吗

int x,y,chanels_in_file=3;
    unsigned char* im_bgr = stbi_load(imagepath,&x,&y,&chanels_in_file,0);
    void* sourcePixelscolor = (void*)im_bgr;
    std::vector<int> nchw = {1,3,x,y};
    auto input_mat = std::make_shared<TNN_NS::Mat>(TNN_NS::DEVICE_ARM, TNN_NS::N8UC4, nchw,sourcePixelscolor);
    TNN_NS::TNN tnn;
    TNN_NS::ModelConfig model_config;
    //proto文件内容存入proto_buffer
    auto proto_tnn = fdLoadFile(model_param);
    auto model_tnn = fdLoadFile(model);
    model_config.model_type = TNN_NS::MODEL_TYPE_TNN;
    model_config.params = {proto_tnn, model_tnn};
     tnn.Init(model_config);

    TNN_NS::NetworkConfig config;
    if(div == "gpu")
        config.device_type = TNN_NS::DEVICE_OPENCL;
    else
        config.device_type = TNN_NS::DEVICE_ARM;
    config.precision = tnn::PRECISION_HIGH;
    TNN_NS::Status error;
    TNN_NS::MatConvertParam input_cvt_param;
    input_cvt_param.scale = {0.017, 0.017, 0.017, 0.0};
    input_cvt_param.bias  = {-2.117, -2.035, -1.804,0.0 };
    auto net_instance = tnn.CreateInst(config, error);
    auto status = net_instance->SetInputMat(input_mat, input_cvt_param);

    status = net_instance->Forward();
    RETURN_ON_NEQ(status, TNN_NS::TNN_OK);
    std::shared_ptr<TNN_NS::Mat> output_mat = nullptr;
    status = net_instance->GetOutputMat(output_mat);

issue, #480

shaundai-tencent · 2020-10-20T03:46:39Z

我提交了一个修复，在hotfix-issue#480分支上面，你可以试试fp16精度下有没有问题。你上面提到的问题我找个一样的机器复现一下。

banbishan · 2020-10-20T03:55:25Z

我提交了一个修复，在hotfix-issue#480分支上面，你可以试试fp16精度下有没有问题。你上面提到的问题我找个一样的机器复现一下。

已经试过e61dbae分支，fp16效果已经可以和cpu运行出的结果一样了，非常感谢。

shaundai-tencent · 2020-10-20T06:23:23Z

好的，我们会尽快合入master分支。

另外，我试了骁龙835和骁龙821两个机器，发现模型无法执行成功，是因为中间内存是使用2Dimage存储的，不同的GPU对于image的W和H限制不一样，835限制是16384。我们是将NCHW的数据格式转为了NHC4W4的格式，因此image的height为N*H，width为Round(C/4)W4，有些层超出了image的W的限制，所以内存无法分配成功。如果遇到类似的问题，只能调整NCHW的值来解决。

banbishan · 2020-10-20T07:27:45Z

ok，感谢。这样好像限制了模型中间layer的channel不能太大，C<=16384/W*4。

BTW，因为是setinput的时候报错，是setinput就已经计算中间层的内存使用量了吗，因为input好像是不会超出限制，round（3/4）*1280=1280<16384，或者16384是bit，1280*sizeof(float)>16384，是这样吗？

shaundai-tencent · 2020-10-20T08:11:29Z

需要 N*H < 16384 && Round(C/4)*W < 16384，和数据类型没有关系。你的网络里面最后965层的concat输出 NCHW为（1,512,320,240），这个就超出了。（16384的限制，不同型号的GPU限制不一样，比如麒麟980的限制是65536，这就可以跑了）

issue, #480

* minor update * Hotfix linux compile (#446) * [CONVERTER][BUG]1. fix complie failed on centos (gcc 4.9); * fix codecc warning (#441) fix codecc warnning Co-authored-by: lnmdlong <lnmdlong@hotmail.com> Co-authored-by: devandong <devandong@tencent.com> Co-authored-by: quinnrong94 <quinnrong@tencent.com> * interpret func use reference param (#451) * interpret func use reference param * ncnn interpret use reference param * fix enum value error (#454) * add missing letter (#455) Co-authored-by: nihui <shuizhuyuanluo@126.com> * Stable v0.2 merge master (#419) * [EXAMPLES][PATCH] add face align demo and refactor for some case * [EXAMPLES][FIX] fix align opencl error * [EXAMPLES][FIX] fix arm linux demo * [EXAMPLE][FIX] fix android preview size error * [UPD]update youtu face alignment mean pts logic (#385) * [BUG]fix YouTu face alignment model * [UPD]update mean pts file logic * [UPD]draw face points green * [UPD]unify example controller list * [UPD]unify example controller list Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com> * [ARM][BUG] fix sqrt layer with zero input (#392) * pull tflite2tnn tools (#378) * add tflite2tnn tools Co-authored-by: lucasktian <lucasktian@tencent.com> * [UPD]move blaze anchor file to resource; fix blazeface error; (#390) * [BUG]fix YouTu face alignment model * [UPD]update mean pts file logic * [UPD]draw face points green * [UPD]unify example controller list * [UPD]unify example controller list * [UPD]move blaze anchor file to resource Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com> * [OPENCL]support google pixel phone opencl mode (#399) Co-authored-by: janchen <janchen@tencent.com> Co-authored-by: ShaunDai <66760945+shaundai-tencent@users.noreply.github.com> Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com> * [UPD] update readme (#404) * [UPD] update readme * [UPD] fix newline in README_en.md Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com> Co-authored-by: darrenyao87 <62542779+darrenyao87@users.noreply.github.com> * [OPENCL][BUG] fix gflops calculate bug in conv (#412) * [OPENCL][BUG] fix gflops calculate bug in conv * [OPENCL][FIX] fix deconv calculate flops Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com> Co-authored-by: neiltian <neiltian@tencent.com> * Hotfix issue 400 (#410) * [CPU] fix bfp16 blob converter * [CPU] fix cpu device allocate * [CPU] skip blob converter to yuv mat Co-authored-by: lucasktian <lucasktian@tencent.com> * [ARM][BUG] fix armv7 gemm_float_n4 error Co-authored-by: darrenyao87 <62542779+darrenyao87@users.noreply.github.com> Co-authored-by: quinnrong94 <67782915+quinnrong94@users.noreply.github.com> Co-authored-by: stephehuang <69882565+stephehuang@users.noreply.github.com> Co-authored-by: lucasktian <lucasktian@tencent.com> Co-authored-by: Bbean <j850447553@icloud.com> Co-authored-by: janchen <janchen@tencent.com> Co-authored-by: ShaunDai <66760945+shaundai-tencent@users.noreply.github.com> Co-authored-by: devandong <67893313+devandong@users.noreply.github.com> Co-authored-by: seanxcwang <66675860+seanxcwang@users.noreply.github.com> * x86 demo minor change * x86 demo add resize * add null check for MatUtils * [RKNPU][CHG] add leakly relu convert (#465) * [NPU][ADD] add Huawei NPU profiling support issue, #463 * [OPENCL][BUG] fix profiling summary incorrect when loop count > 1 * add webcam based demo * add null check for MatUtils (#466) * [Fix] Fix pad layer inconsistent problem * [X86][OPENVINO] increase x86 unary layer operator * add x86 demo: blaze face detector & aligner * x86 demo change to UltraFaceDetecotr * Update reshape 's conversion in TFLite (#469) Update reshape 's conversion in TFLite Co-authored-by: lucasktian <lucasktian@tencent.com> * x86 demo msvc ok * fix cmake versioning & macos build scripts * [X86][OPENVINO] Add Binary Op Frame * [COMPILE][FIX] fix gnustl_static compile error and warning * fix xcode compile error * fix xcode build errors * [FIX] fix sdk sample build error * [NPU][CHG] refactor cpu blob converter seperate NPU blob converter from cpu blob converter * [X86] Increase x86 convolution layer (im2col) * [EXAMPLE][BUG] fix cls id error * [X86] add pooling layer (max & average) * [OPENCL][BUG] fix fp16 overflow risk in avg pooling issue, #480 * [OPENCL][CHG] add more error info * [X86] Add batch and scale layer * [X86] add reduce op layer * fix base layer_builder Init * build metal on macos * [X86][OPENVINO] add splitv layer for openvino * fix display of README_EN (#484) Co-authored-by: darrenyao87 <62542779+darrenyao87@users.noreply.github.com> * [x86] add all reduce layer operations * Opencl reduce softmax opt (#443) * [OPENCL][BUG] skip NNV21/NNV12 blob converter test case, not supported for now * [OPENCL][OPT] optimize reduce perf with fine-grained parallelism when parallelism is low, intensity is high * [OPENCL][BUG] fix workgroup size init * [OPENCL][BUG] fix work group size init, ensure size to be power of 2 * [OPENCL][OPT] optimize softmax perf with fine-grained parallelism when parallelism is low, intensity is high * [OPENCL] refine code for pull request * update opencl program * [OPENCL][FIX] use fp16 for local memory when enable && fix global work items filter && set threshold based on experiments Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com> Co-authored-by: neiltian <neiltian@tencent.com> * Feature fp16 workflow (#482) * [OPT] rename int8 reformat; change SupportDevice to IsSupported * [OPT] add GetEnabledPrecision in abstract device; implement RegisterLayerPrecision in arm device * [OPT] update global_device_map only once * [OPT] set fp16 blob in network initlayers * [OPT] support fp16 reformat in net_optimizer * [OPT] get cpu fp16 capability; refactor update blob precision * [FIX] update fp16 blob with cpu support * [FIX] fix typo * [CHG] only update precision for cpu; rename to ImplementedPrecision * Npu fp16 fix (#488) * [NPU][UPD] add test android * [NPU][UPD] add fp16 * [NPU][UPD] modify build test script * [NPU][UPD]add permute op Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com> Co-authored-by: ShaunDai <66760945+shaundai-tencent@users.noreply.github.com> * [FIX] fix inner_product_layer_builder error * [FIX][X86][OPENVINO] fix openvino deconvolution shape unaligned * [NPU][BUG] fix comiple error due to api change * Enhance arm int8 (#486) [ARM][OPT] 1. add dw tail process 2. add qgemm asm kernel(big hw and small c) 3. add conv impl factory Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com> * Feature mat make border (#491) * [CHG] enhance mat converter param check * [CPU] implement cpu copy make border * [TEST] add copy make border unit test * [ARM] support copy make border * [METAL] support copy make border * [CHG] reset dst mat only when its data is nullptr * [OPENCL][ADD] support copy make border * [ARM] optimize mat copy make border * [Metal] disable interpolation-related unit_tests on Metal Co-authored-by: devandong <devandong@tencent.com> Co-authored-by: lnmdlong <lnmdlong@hotmail.com> Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com> * [OPENCL] fix chinese comments (#493) * [OPENCL] fix chinese comments * [DEVICE][OPENCL] change comment Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com> Co-authored-by: neiltian <neiltian@tencent.com> * [X86] Add HardSwish layer acc * [X86] Add Optimized HardSwish Layer ACC, fix custom_implmentation issue * Enhance warpaffine nearest (#501) * [CPU] support nearest warpaffine * [CPU] fix nearest choose error * [ARM] support nearest warpaffine * [Metal] support nearest warpaffine * [Metal] fix bilinear warpaffine border access error * [ARM] optmize channel equals 4 * [OPENCL] support nearest warpaffine Co-authored-by: devandong <devandong@tencent.com> Co-authored-by: lnmdlong <lnmdlong@hotmail.com> * [ONNX][BUG] fix pool fusion bug (#500) fix pool fusion bug * [DEV][UPD] 1. Int8Reformat -> Reformat; * [X86] add concat layer * [X86] resolve conflicts * [X86][OPENVINO] fix splitv layer builder Co-authored-by: Dandiding <Dandiding@tencent.com> Co-authored-by: lucasktian <lucasktian@tencent.com> Co-authored-by: seanxcwang <66675860+seanxcwang@users.noreply.github.com> Co-authored-by: lnmdlong <lnmdlong@hotmail.com> Co-authored-by: devandong <devandong@tencent.com> Co-authored-by: quinnrong94 <quinnrong@tencent.com> Co-authored-by: 103yiran <1039105206@qq.com> Co-authored-by: nihui <shuizhuyuanluo@126.com> Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com> Co-authored-by: darrenyao87 <62542779+darrenyao87@users.noreply.github.com> Co-authored-by: quinnrong94 <67782915+quinnrong94@users.noreply.github.com> Co-authored-by: stephehuang <69882565+stephehuang@users.noreply.github.com> Co-authored-by: Bbean <j850447553@icloud.com> Co-authored-by: janchen <janchen@tencent.com> Co-authored-by: ShaunDai <66760945+shaundai-tencent@users.noreply.github.com> Co-authored-by: devandong <67893313+devandong@users.noreply.github.com> Co-authored-by: shaundai <shaundai@tencent.com> Co-authored-by: Dandi Ding <bluaxe@users.noreply.github.com> Co-authored-by: ealinli <37806708+1627180283@users.noreply.github.com> Co-authored-by: seanxcwang <seanxcwang@tencent.com> Co-authored-by: neiltian <neiltian@tencent.com> Co-authored-by: yeli <32798887+yl16417@users.noreply.github.com>

cc0706329 · 2020-12-23T13:48:55Z

好的，我们会尽快合入master分支。

另外，我试了骁龙835和骁龙821两个机器，发现模型无法执行成功，是因为中间内存是使用2Dimage存储的，不同的GPU对于image的W和H限制不一样，835限制是16384。我们是将NCHW的数据格式转为了NHC4W4的格式，因此image的height为N*H，width为Round(C/4)_W_4，有些层超出了image的W的限制，所以内存无法分配成功。如果遇到类似的问题，只能调整NCHW的值来解决。

那fp16为什么没这个问题呢，中间内存2Dimage的宽高不是一样的吗

shaundai-tencent · 2020-12-24T09:30:28Z

跑的都是fp16，这个和数据类型没有关系

darrenyao87 assigned shaundai-tencent Oct 19, 2020

darrenyao87 added the good first issue Good for newcomers label Oct 19, 2020

shaundai-tencent added a commit that referenced this issue Oct 20, 2020

[OPENCL][BUG] fix fp16 overflow risk in avg pooling

e61dbae

issue, #480

banbishan mentioned this issue Oct 20, 2020

arm端opencl check model是pass的，但是结果就是和arm cpu跑出来的有误差 #462

Closed

shaundai-tencent mentioned this issue Oct 20, 2020

[OPENCL][BUG] fix fp16 overflow risk in avg pooling #485

Merged

shaundai-tencent linked a pull request Oct 20, 2020 that will close this issue

[OPENCL][BUG] fix fp16 overflow risk in avg pooling #485

Merged

banbishan closed this as completed Oct 20, 2020

shaundai-tencent added a commit that referenced this issue Oct 20, 2020

[OPENCL][BUG] fix fp16 overflow risk in avg pooling

1a4e7c8

issue, #480

gttiankai pushed a commit that referenced this issue Oct 20, 2020

[OPENCL][BUG] fix fp16 overflow risk in avg pooling

bc9475a

issue, #480

gttiankai pushed a commit that referenced this issue Nov 2, 2020

[OPENCL][BUG] fix fp16 overflow risk in avg pooling

63d1376

issue, #480

banbishan mentioned this issue Nov 10, 2020

推理时动态Reshape试无法推理output shape的问题 #532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm端opencl输出和cpu输出存在较大误差 #480

arm端opencl输出和cpu输出存在较大误差 #480

banbishan commented Oct 19, 2020 •

edited

Loading

darrenyao87 commented Oct 19, 2020

banbishan commented Oct 19, 2020 •

edited

Loading

shaundai-tencent commented Oct 19, 2020 •

edited

Loading

banbishan commented Oct 19, 2020

shaundai-tencent commented Oct 19, 2020

banbishan commented Oct 20, 2020

shaundai-tencent commented Oct 20, 2020

banbishan commented Oct 20, 2020

shaundai-tencent commented Oct 20, 2020

banbishan commented Oct 20, 2020 •

edited

Loading

shaundai-tencent commented Oct 20, 2020

cc0706329 commented Dec 23, 2020

shaundai-tencent commented Dec 24, 2020

arm端opencl输出和cpu输出存在较大误差 #480

arm端opencl输出和cpu输出存在较大误差 #480

Comments

banbishan commented Oct 19, 2020 • edited Loading

darrenyao87 commented Oct 19, 2020

banbishan commented Oct 19, 2020 • edited Loading

shaundai-tencent commented Oct 19, 2020 • edited Loading

banbishan commented Oct 19, 2020

shaundai-tencent commented Oct 19, 2020

banbishan commented Oct 20, 2020

shaundai-tencent commented Oct 20, 2020

banbishan commented Oct 20, 2020

shaundai-tencent commented Oct 20, 2020

banbishan commented Oct 20, 2020 • edited Loading

shaundai-tencent commented Oct 20, 2020

cc0706329 commented Dec 23, 2020

shaundai-tencent commented Dec 24, 2020

banbishan commented Oct 19, 2020 •

edited

Loading

banbishan commented Oct 19, 2020 •

edited

Loading

shaundai-tencent commented Oct 19, 2020 •

edited

Loading

banbishan commented Oct 20, 2020 •

edited

Loading