Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm端opencl输出和cpu输出存在较大误差 #480

Closed
banbishan opened this issue Oct 19, 2020 · 13 comments · Fixed by #485
Closed

arm端opencl输出和cpu输出存在较大误差 #480

banbishan opened this issue Oct 19, 2020 · 13 comments · Fixed by #485
Assignees
Labels
good first issue Good for newcomers

Comments

@banbishan
Copy link

banbishan commented Oct 19, 2020

image

上图左边是cpu输出正确,右边是设置opencl后的输出,两者存在较大误差,下面是模型文件,烦请帮忙看下,谢谢。
平台是MT8163,ndk17交叉编译
20201013.zip

@darrenyao87
Copy link
Collaborator

目前已知的是OpenCL计算是fp16而arm是fp32,所以肯定存在差异。一方面 @shaundai-tencent 确认下差异是否正常,另一方面 @banbishan 可以看看对效果的影响多大。

@banbishan
Copy link
Author

banbishan commented Oct 19, 2020

目前已知的是OpenCL计算是fp16而arm是fp32,所以肯定存在差异。一方面 @shaundai-tencent 确认下差异是否正常,另一方面 @banbishan 可以看看对效果的影响多大。

是个检测网络,后处理后的效果,检测的框有的两个框变成了一个,还有一小部分漏检。最终的效果目前来看无法使用,cpu的输出是可以使用的,但速度太慢。

@shaundai-tencent
Copy link
Collaborator

shaundai-tencent commented Oct 19, 2020

我排查了一下,发现是OpenCL在fp16精度下,global pooling层(419),中间累加结果溢出了。你可以先试一下FP32精度是否满足精度和性能要求。在配置NetworkConfig的时候设置为PRECISION_HIGH,可以开启高精度模式。

@banbishan
Copy link
Author

我排查了一下,发现是OpenCL在fp16精度下,global pooling层(419),中间累加结果溢出了。你可以先试一下FP32精度是否满足精度和性能要求。在配置NetworkConfig的时候设置为PRECISION_HIGH,可以开启高精度模式。

@shaundai-tencent 我在代码里加上了config.precision = tnn::PRECISION_NORMAL;这句就报了如下的错误,
96426416-c7d01d80-122f-11eb-807a-13ba4cff0617

@shaundai-tencent
Copy link
Collaborator

我这边没有复现你的报错,我这边设置NORMAL也是可以的。我用的是Mate30Pro测试的。

后续我会修复一下pooling中avg pooling有可能溢出的问题,建议先用Normal的精度执行。

@banbishan
Copy link
Author

我这边没有复现你的报错,我这边设置NORMAL也是可以的。我用的是Mate30Pro测试的。

后续我会修复一下pooling中avg pooling有可能溢出的问题,建议先用Normal的精度执行。

@shaundai-tencent 我使用normal和high都会报上面这个错,换了使用骁龙835的手机也是同样的,我的代码如下,使用的tnn是刚刚从github上clone的,在SetInputMat时报出上面图里的错,注释掉PRECISION_HIGH不会报错,能麻烦看下哪里的问题吗

int x,y,chanels_in_file=3;
    unsigned char* im_bgr = stbi_load(imagepath,&x,&y,&chanels_in_file,0);
    void* sourcePixelscolor = (void*)im_bgr;
    std::vector<int> nchw = {1,3,x,y};
    auto input_mat = std::make_shared<TNN_NS::Mat>(TNN_NS::DEVICE_ARM, TNN_NS::N8UC4, nchw,sourcePixelscolor);
    TNN_NS::TNN tnn;
    TNN_NS::ModelConfig model_config;
    //proto文件内容存入proto_buffer
    auto proto_tnn = fdLoadFile(model_param);
    auto model_tnn = fdLoadFile(model);
    model_config.model_type = TNN_NS::MODEL_TYPE_TNN;
    model_config.params = {proto_tnn, model_tnn};
     tnn.Init(model_config);

    TNN_NS::NetworkConfig config;
    if(div == "gpu")
        config.device_type = TNN_NS::DEVICE_OPENCL;
    else
        config.device_type = TNN_NS::DEVICE_ARM;
    config.precision = tnn::PRECISION_HIGH;
    TNN_NS::Status error;
    TNN_NS::MatConvertParam input_cvt_param;
    input_cvt_param.scale = {0.017, 0.017, 0.017, 0.0};
    input_cvt_param.bias  = {-2.117, -2.035, -1.804,0.0 };
    auto net_instance = tnn.CreateInst(config, error);
    auto status = net_instance->SetInputMat(input_mat, input_cvt_param);

    status = net_instance->Forward();
    RETURN_ON_NEQ(status, TNN_NS::TNN_OK);
    std::shared_ptr<TNN_NS::Mat> output_mat = nullptr;
    status = net_instance->GetOutputMat(output_mat);

@shaundai-tencent
Copy link
Collaborator

我提交了一个修复,在hotfix-issue#480分支上面,你可以试试fp16精度下有没有问题。你上面提到的问题我找个一样的机器复现一下。

@banbishan
Copy link
Author

我提交了一个修复,在hotfix-issue#480分支上面,你可以试试fp16精度下有没有问题。你上面提到的问题我找个一样的机器复现一下。

已经试过e61dbae分支,fp16效果已经可以和cpu运行出的结果一样了,非常感谢。

@shaundai-tencent
Copy link
Collaborator

好的,我们会尽快合入master分支。

另外,我试了骁龙835和骁龙821两个机器,发现模型无法执行成功,是因为中间内存是使用2Dimage存储的,不同的GPU对于image的W和H限制不一样,835限制是16384。我们是将NCHW的数据格式转为了NHC4W4的格式,因此image的height为N*H,width为Round(C/4)W4,有些层超出了image的W的限制,所以内存无法分配成功。如果遇到类似的问题,只能调整NCHW的值来解决。

@banbishan
Copy link
Author

banbishan commented Oct 20, 2020

ok,感谢。这样好像限制了模型中间layer的channel不能太大,C<=16384/W*4。

BTW,因为是setinput的时候报错,是setinput就已经计算中间层的内存使用量了吗,因为input好像是不会超出限制,round(3/4)*1280=1280<16384,或者16384是bit,1280*sizeof(float)>16384,是这样吗?

@shaundai-tencent
Copy link
Collaborator

需要 N*H < 16384 && Round(C/4)*W < 16384,和数据类型没有关系。你的网络里面最后965层的concat输出 NCHW为(1,512,320,240),这个就超出了。(16384的限制,不同型号的GPU限制不一样,比如麒麟980的限制是65536,这就可以跑了)

bluaxe added a commit that referenced this issue Nov 9, 2020
* minor update

* Hotfix linux compile (#446)

* [CONVERTER][BUG]1. fix complie failed on centos (gcc 4.9);

* fix codecc warning (#441)

fix codecc warnning

Co-authored-by: lnmdlong <lnmdlong@hotmail.com>
Co-authored-by: devandong <devandong@tencent.com>
Co-authored-by: quinnrong94 <quinnrong@tencent.com>

* interpret func use reference param (#451)

* interpret func use reference param

* ncnn interpret use reference param

* fix enum value error (#454)

* add missing letter (#455)

Co-authored-by: nihui <shuizhuyuanluo@126.com>

* Stable v0.2 merge master (#419)

* [EXAMPLES][PATCH] add face align demo and refactor for some case

* [EXAMPLES][FIX] fix align opencl error

* [EXAMPLES][FIX] fix arm linux demo

* [EXAMPLE][FIX] fix android preview size error

* [UPD]update youtu face alignment mean pts logic (#385)

* [BUG]fix YouTu face alignment model

* [UPD]update mean pts file logic

* [UPD]draw face points green

* [UPD]unify example controller list

* [UPD]unify example controller list

Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com>

* [ARM][BUG] fix sqrt layer with zero input (#392)

* pull tflite2tnn tools (#378)

* add tflite2tnn tools

Co-authored-by: lucasktian <lucasktian@tencent.com>

* [UPD]move blaze anchor file to resource; fix blazeface error; (#390)

* [BUG]fix YouTu face alignment model

* [UPD]update mean pts file logic

* [UPD]draw face points green

* [UPD]unify example controller list

* [UPD]unify example controller list

* [UPD]move blaze anchor file to resource

Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com>

* [OPENCL]support google pixel phone opencl mode (#399)

Co-authored-by: janchen <janchen@tencent.com>
Co-authored-by: ShaunDai <66760945+shaundai-tencent@users.noreply.github.com>
Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com>

* [UPD] update readme (#404)

* [UPD] update readme

* [UPD] fix newline in README_en.md

Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com>
Co-authored-by: darrenyao87 <62542779+darrenyao87@users.noreply.github.com>

* [OPENCL][BUG] fix gflops calculate bug in conv (#412)

* [OPENCL][BUG] fix gflops calculate bug in conv

* [OPENCL][FIX] fix deconv calculate flops

Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com>
Co-authored-by: neiltian <neiltian@tencent.com>

* Hotfix issue 400 (#410)

* [CPU] fix bfp16 blob converter

* [CPU] fix cpu device allocate

* [CPU] skip blob converter to yuv mat

Co-authored-by: lucasktian <lucasktian@tencent.com>

* [ARM][BUG] fix armv7 gemm_float_n4 error

Co-authored-by: darrenyao87 <62542779+darrenyao87@users.noreply.github.com>
Co-authored-by: quinnrong94 <67782915+quinnrong94@users.noreply.github.com>
Co-authored-by: stephehuang <69882565+stephehuang@users.noreply.github.com>
Co-authored-by: lucasktian <lucasktian@tencent.com>
Co-authored-by: Bbean <j850447553@icloud.com>
Co-authored-by: janchen <janchen@tencent.com>
Co-authored-by: ShaunDai <66760945+shaundai-tencent@users.noreply.github.com>
Co-authored-by: devandong <67893313+devandong@users.noreply.github.com>
Co-authored-by: seanxcwang <66675860+seanxcwang@users.noreply.github.com>

* x86 demo minor change

* x86 demo add resize

* add null check for MatUtils

* [RKNPU][CHG] add leakly relu convert (#465)

* [NPU][ADD] add Huawei NPU profiling support

issue, #463

* [OPENCL][BUG] fix profiling summary incorrect when loop count > 1

* add webcam based demo

* add null check for MatUtils (#466)

* [Fix] Fix pad layer inconsistent problem

* [X86][OPENVINO] increase x86 unary layer operator

* add x86 demo: blaze face detector & aligner

* x86 demo change to UltraFaceDetecotr

* Update reshape 's conversion in TFLite (#469)

Update reshape 's conversion in TFLite

Co-authored-by: lucasktian <lucasktian@tencent.com>

* x86 demo msvc ok

* fix cmake versioning & macos build scripts

* [X86][OPENVINO] Add Binary Op Frame

* [COMPILE][FIX] fix gnustl_static compile error and warning

* fix xcode compile error

* fix xcode build errors

* [FIX] fix sdk sample build error

* [NPU][CHG] refactor cpu blob converter

seperate NPU blob converter from cpu blob converter

* [X86] Increase x86 convolution layer (im2col)

* [EXAMPLE][BUG] fix cls id error

* [X86] add pooling layer (max & average)

* [OPENCL][BUG] fix fp16 overflow risk in avg pooling

issue, #480

* [OPENCL][CHG] add more error info

* [X86] Add batch and scale layer

* [X86] add reduce op layer

* fix base layer_builder Init

* build metal on macos

* [X86][OPENVINO] add splitv layer for openvino

* fix display of README_EN (#484)

Co-authored-by: darrenyao87 <62542779+darrenyao87@users.noreply.github.com>

* [x86] add all reduce layer operations

* Opencl reduce softmax opt (#443)

* [OPENCL][BUG] skip NNV21/NNV12 blob converter test case, not supported for now

* [OPENCL][OPT] optimize reduce perf with fine-grained parallelism when parallelism is low, intensity is high

* [OPENCL][BUG] fix workgroup size init

* [OPENCL][BUG] fix work group size init, ensure size to be power of 2

* [OPENCL][OPT] optimize softmax perf with fine-grained parallelism when parallelism is low, intensity is high

* [OPENCL] refine code for pull request

* update opencl program

* [OPENCL][FIX] use fp16 for local memory when enable && fix global work items filter && set threshold based on experiments

Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com>
Co-authored-by: neiltian <neiltian@tencent.com>

* Feature fp16 workflow (#482)

* [OPT] rename int8 reformat; change SupportDevice to IsSupported

* [OPT] add GetEnabledPrecision in abstract device; implement RegisterLayerPrecision in arm device

* [OPT] update global_device_map only once

* [OPT] set fp16 blob in network initlayers

* [OPT] support fp16 reformat in net_optimizer

* [OPT] get cpu fp16 capability; refactor update blob precision

* [FIX] update fp16 blob with cpu support

* [FIX] fix typo

* [CHG] only update precision for cpu; rename to ImplementedPrecision

* Npu fp16 fix (#488)

* [NPU][UPD] add test android

* [NPU][UPD] add fp16

* [NPU][UPD] modify build test script

* [NPU][UPD]add permute op

Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com>
Co-authored-by: ShaunDai <66760945+shaundai-tencent@users.noreply.github.com>

* [FIX] fix inner_product_layer_builder error

* [FIX][X86][OPENVINO] fix openvino deconvolution shape unaligned

* [NPU][BUG] fix comiple error due to api change

* Enhance arm int8 (#486)

[ARM][OPT] 1. add dw tail process 2. add qgemm asm kernel(big hw and small c) 3. add conv impl factory

Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com>

* Feature mat make border (#491)

* [CHG] enhance mat converter param check

* [CPU] implement cpu copy make border

* [TEST] add copy make border unit test

* [ARM] support copy make border

* [METAL] support copy make border

* [CHG] reset dst mat only when its data is nullptr

* [OPENCL][ADD] support copy make border

* [ARM] optimize mat copy make border

* [Metal] disable interpolation-related unit_tests on Metal

Co-authored-by: devandong <devandong@tencent.com>
Co-authored-by: lnmdlong <lnmdlong@hotmail.com>
Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com>

* [OPENCL] fix chinese comments (#493)

* [OPENCL] fix chinese comments

* [DEVICE][OPENCL] change comment

Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com>
Co-authored-by: neiltian <neiltian@tencent.com>

* [X86] Add HardSwish layer acc

* [X86] Add Optimized HardSwish Layer ACC, fix custom_implmentation issue

* Enhance warpaffine nearest (#501)

* [CPU] support nearest warpaffine

* [CPU] fix nearest choose error

* [ARM] support nearest warpaffine

* [Metal] support nearest warpaffine

* [Metal] fix bilinear warpaffine border access error

* [ARM] optmize channel equals 4

* [OPENCL] support nearest warpaffine

Co-authored-by: devandong <devandong@tencent.com>
Co-authored-by: lnmdlong <lnmdlong@hotmail.com>

* [ONNX][BUG] fix pool fusion bug (#500)

fix pool fusion bug

* [DEV][UPD] 1. Int8Reformat -> Reformat;

* [X86] add concat layer

* [X86] resolve conflicts

* [X86][OPENVINO] fix splitv layer builder

Co-authored-by: Dandiding <Dandiding@tencent.com>
Co-authored-by: lucasktian <lucasktian@tencent.com>
Co-authored-by: seanxcwang <66675860+seanxcwang@users.noreply.github.com>
Co-authored-by: lnmdlong <lnmdlong@hotmail.com>
Co-authored-by: devandong <devandong@tencent.com>
Co-authored-by: quinnrong94 <quinnrong@tencent.com>
Co-authored-by: 103yiran <1039105206@qq.com>
Co-authored-by: nihui <shuizhuyuanluo@126.com>
Co-authored-by: neiltian <65950677+neiltian-tencent@users.noreply.github.com>
Co-authored-by: darrenyao87 <62542779+darrenyao87@users.noreply.github.com>
Co-authored-by: quinnrong94 <67782915+quinnrong94@users.noreply.github.com>
Co-authored-by: stephehuang <69882565+stephehuang@users.noreply.github.com>
Co-authored-by: Bbean <j850447553@icloud.com>
Co-authored-by: janchen <janchen@tencent.com>
Co-authored-by: ShaunDai <66760945+shaundai-tencent@users.noreply.github.com>
Co-authored-by: devandong <67893313+devandong@users.noreply.github.com>
Co-authored-by: shaundai <shaundai@tencent.com>
Co-authored-by: Dandi Ding <bluaxe@users.noreply.github.com>
Co-authored-by: ealinli <37806708+1627180283@users.noreply.github.com>
Co-authored-by: seanxcwang <seanxcwang@tencent.com>
Co-authored-by: neiltian <neiltian@tencent.com>
Co-authored-by: yeli <32798887+yl16417@users.noreply.github.com>
@cc0706329
Copy link

好的,我们会尽快合入master分支。

另外,我试了骁龙835和骁龙821两个机器,发现模型无法执行成功,是因为中间内存是使用2Dimage存储的,不同的GPU对于image的W和H限制不一样,835限制是16384。我们是将NCHW的数据格式转为了NHC4W4的格式,因此image的height为N*H,width为Round(C/4)_W_4,有些层超出了image的W的限制,所以内存无法分配成功。如果遇到类似的问题,只能调整NCHW的值来解决。

那fp16为什么没这个问题呢,中间内存2Dimage的宽高不是一样的吗

@shaundai-tencent
Copy link
Collaborator

跑的都是fp16,这个和数据类型没有关系

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants