[Gelu x86] Finish intrinsic with elempack merged(fast version) #4144

LRY89757 · 2022-08-15T15:27:25Z

实现了Gelu的x86平台优化
仅使用了fast gelu版本，也就是近似版本的erfc:
添加了tanh的mathfunc sse/avx/avx512实现
添加了test_sample追求覆盖率.

codecov-commenter · 2022-08-15T15:49:34Z

Codecov Report

Merging #4144 (6e9cf57) into master (9f59711) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4144      +/-   ##
==========================================
+ Coverage   94.41%   94.44%   +0.03%     
==========================================
  Files         749      750       +1     
  Lines      179061   179180     +119     
==========================================
+ Hits       169053   169222     +169     
+ Misses      10008     9958      -50

Impacted Files	Coverage Δ
src/layer/x86/avx512_mathfun.h	`100.00% <100.00%> (ø)`
src/layer/x86/avx_mathfun.h	`100.00% <100.00%> (ø)`
src/layer/x86/gelu_x86.cpp	`100.00% <100.00%> (ø)`
src/layer/x86/sse_mathfun.h	`100.00% <100.00%> (ø)`
src/layer/riscv/convolution1d_riscv.cpp	`99.00% <0.00%> (+0.24%)`	⬆️
src/layer/riscv/convolution_3x3_packn_fp16s.h	`99.48% <0.00%> (+0.51%)`	⬆️
src/layer/riscv/convolution_3x3_pack1ton_fp16s.h	`100.00% <0.00%> (+10.85%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

LRY89757 · 2022-08-16T13:04:55Z

请教一个问题，关于mathfun.h这里的定义宏，为什么要这么定义一个这样的宏：

/* declare some AVX constants -- why can't I figure a better way to do that? */
#define _PS512_CONST(Name, Val) \
    static const ALIGN64_BEG float _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PI32_CONST512(Name, Val) \
    static const ALIGN64_BEG int _pi32_512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PS512_CONST_TYPE(Name, Type, Val) \
    static const ALIGN64_BEG Type _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}

不应该直接使用_mm_set1_ps这类函数就可以了吗？是有着什么更深的用意吗？

关于tanh的实现这里暂时借用了exp，非常naive的一种实现，单独专门的simd x86优化正在做 [WIP]（已实现）

LRY89757 · 2022-08-16T14:09:17Z

实现了tanh的fast simd x86版本

src/layer/x86/gelu_x86.h

src/layer/x86/gelu_x86.cpp

nihui · 2022-09-17T04:29:55Z

请教一个问题，关于mathfun.h这里的定义宏，为什么要这么定义一个这样的宏：

/* declare some AVX constants -- why can't I figure a better way to do that? */
#define _PS512_CONST(Name, Val) \
    static const ALIGN64_BEG float _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PI32_CONST512(Name, Val) \
    static const ALIGN64_BEG int _pi32_512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PS512_CONST_TYPE(Name, Type, Val) \
    static const ALIGN64_BEG Type _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}

不应该直接使用_mm_set1_ps这类函数就可以了吗？是有着什么更深的用意吗？

2. ~关于`tanh`的实现这里暂时借用了`exp`， 非常naive的一种实现，单独专门的`simd x86`优化正在做 **[WIP]**~（已实现）

这可能得请教原作者，目前的写法很可能是与编译器斗智斗勇的结果

猜测

__m256 这样的寄存器无法写作全局静态的变量，结局就是编译器把这些数值放在全局静态区中，运行时载入
编译器可能会不对齐的存放数据，于是作者干脆直接写成数组，并强制要求对齐，提升运行时载入效率

LRY89757 · 2022-09-17T06:35:00Z

请教一个问题，关于mathfun.h这里的定义宏，为什么要这么定义一个这样的宏：
/* declare some AVX constants -- why can't I figure a better way to do that? */
#define _PS512_CONST(Name, Val) \
    static const ALIGN64_BEG float _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PI32_CONST512(Name, Val) \
    static const ALIGN64_BEG int _pi32_512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
#define _PS512_CONST_TYPE(Name, Type, Val) \
    static const ALIGN64_BEG Type _ps512_##Name[16] ALIGN64_END = {Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val, Val}
不应该直接使用_mm_set1_ps这类函数就可以了吗？是有着什么更深的用意吗？
2. ~关于`tanh`的实现这里暂时借用了`exp`， 非常naive的一种实现，单独专门的`simd x86`优化正在做 **[WIP]**~（已实现）
这可能得请教原作者，目前的写法很可能是与编译器斗智斗勇的结果

猜测

__m256 这样的寄存器无法写作全局静态的变量，结局就是编译器把这些数值放在全局静态区中，运行时载入

编译器可能会不对齐的存放数据，于是作者干脆直接写成数组，并强制要求对齐，提升运行时载入效率

Got it, thanks for guidance. I will improve the codes as soon as possible.

LRY89757 · 2022-09-17T08:16:09Z

已经按照对应的指导改正代码格式，同时也修正添加了create_pipeline函数用来回退版本

有一个进一步问题，既然我此前x86 simd代码全部使用的是fast_gelu用来计算，同时test函数并没有报错，是否这意味着我们没有必要使用erfc来进行推理而直接全部使用fast_gelu版本来进行计算即可，因为两者计算出来的结果几乎没有误差？
如果有必要的话我会把simd版本的erfc加上去

nihui · 2022-09-17T08:58:44Z

已经按照对应的指导改正代码格式，同时也修正添加了create_pipeline函数用来回退版本

有一个进一步问题，既然我此前x86 simd代码全部使用的是fast_gelu用来计算，同时test函数并没有报错，是否这意味着我们没有必要使用erfc来进行推理而直接全部使用fast_gelu版本来进行计算即可，因为两者计算出来的结果几乎没有误差？如果有必要的话我会把simd版本的erfc加上去

没有必要，erfc 在 naive 实现里面就足够了，为的就是参考一下
目前pnnx那边转出ncnn的gelu一律会设置 fast_gelu=1

nihui · 2022-09-18T05:07:44Z

Thanks for your contribution !

* remove duplicated newline (Tencent#4187) * remove duplicated newline (Tencent#4188) * optmize softmax arm neon (Tencent#4171) * [docs] Fix typo (Tencent#4201) * [Prelu x86] Finish intrinsic with elempack merged (Tencent#4177) * changed size of images for pretty formatting of page (Tencent#4193) * [Gelu x86] Finish intrinsic with elempack merged(fast version) (Tencent#4144) * Finish the gelu x86 intrinsics * Finish the fast tanh x86 simd impl * Ignore .xmake directory (Tencent#4212) * Bump pypa/cibuildwheel from 2.9.0 to 2.10.1 (Tencent#4207) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.9.0 to 2.10.1. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.9.0...v2.10.1) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * style: space alignment (Tencent#4217) * Ignore CMakeSettings.json, the Visual Studio CMake schema file (Tencent#4228) * RVV: use new interface for segment load/store & change word_type to size_t&add clang ci (part Tencent#4100) (Tencent#4118) * RVV: use size_t for vl * RVV: replace vsseg.v tuple type by using regex ----- search: vsseg([1-9])e(8|16|32)_v_(f|i|u)\2m(1|2|4|8)x\1$([ -~]+), vcreate_\3\2m\4x\1\(([ -~]+)$, vl\); substitute by: vsseg$1e$2_v_$3$2m$4($5, $6, vl); * RVV: replace vssseg.v tuple types by using regex --- search: vssseg([1-9])e(8|16|32)_v_f\2m1x\1$([ -~]+), vcreate_f\2m1x\1\(([ -~]+)$, vl\); substitute by: vssseg$1e$2_v_f$2m1($3, $4, vl); * RVV: replace vlseg.v tuple types in load/store * RVV: replace vloxseg2ei32.v tuple types * RVV: add a wrapper for old compilers * RVV: add segment load/store wrapper in pakcing * RVV: fix cmake test * RVV: make clang happy by dropping VLAs in sgemm * RVV: add clang cmake toolchain configure * RVV: add clang ci, riscv64-unknown-linux-gnu Co-authored-by: thelastlin <thelastlin@users.noreply.github.com> Co-authored-by: nihui <shuizhuyuanluo@126.com> * Bump pypa/cibuildwheel from 2.10.1 to 2.10.2 (Tencent#4220) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.10.1 to 2.10.2. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.10.1...v2.10.2) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * add c906 build ci (Tencent#4232) * Add benchmark result of T-Head TH1520 (Tencent#4240) `cpuinfo`: ``` isa : rv64imafdcvsu mmu : sv39 cpu-freq : 1.848Ghz cpu-icache : 64KB cpu-dcache : 64KB cpu-l2cache : 1MB cpu-tlb : 1024 4-ways cpu-cacheline : 64Bytes cpu-vector : 0.7.1 ``` Compiled with `-DCMAKE_TOOLCHAIN_FILE=../toolchains/c910-v240.toolchain.cmake -DCMAKE_BUILD_TYPE=release -DNCNN_OPENMP=OFF -DNCNN_THREADS=OFF -DNCNN_RUNTIME_CPU=OFF -DNCNN_RVV=ON -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON` Seems much worse than expected 🤔 * fix param parsing issue when layer/blob name exceeds 255 (Tencent#4236) * fix param parsing issue when layer/blob name exceeds 255 * apply code-format changes Co-authored-by: ZhangGe6 <ZhangGe6@users.noreply.github.com> * Memory Pool Improvement For Variadic Sized Inputs (Tencent#4190) * Simple miss count for better space efficiency * Simple double ended greedy; * Add size drop threshold setter; * set workspace allocator cr to zero as we had some sort of recylcing capability :P Co-authored-by: LinHeLurking <LinHeLurking@users.noreply.github.com> Co-authored-by: nihuini <nihuini@tencent.com> * docs: disable fp16 when wrong results encountered caused by overflow (Tencent#4248) * pnnx math operation (Tencent#4251) * more stricter armv7 fp16 and armv84 bf16 compiler check, fix Tencent#4147 fix Tencent#4222 (Tencent#4247) * modified the param axes of expanddims in modelwriter (Tencent#4259) * Add TH1520 (4*C910V) toolchain support. (Tencent#4267) * implement lstm proj_size (Tencent#4263) * Optimize x86 DeformableConv2D (Tencent#4128) * fix compile warning with gcc 9.1.0 including simplestl.h file (Tencent#4274) * fix compile warning with gcc 9.1.0 including simplestl.h file * apply code-format changes Co-authored-by: veahow <veahow@users.noreply.github.com> * add benchmark for rk3588 on rock5b (Tencent#4275) * linux-x64-cpu-gcc on tencent ci * implement layer feature disabled bit (Tencent#4278) * add elu vulkan operator (Tencent#4280) * fix tencent ci (Tencent#4277) * implement GLU and pnnx conversion (Tencent#4283) * Bump pypa/cibuildwheel from 2.10.2 to 2.11.1 (Tencent#4271) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.10.2 to 2.11.1. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.10.2...v2.11.1) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix pnnx softmax/normalize/slice negative axis conversion to ncnn (Tencent#4284) * pnnx glu batchindex aware conversion (Tencent#4285) * 1. Fix typo in readme (Tencent#4287) * x86 sse2/avx2 optimization for convolution sgemm/winograd int8 family (Tencent#4286) * pnnx skip dynamic size evaluation (Tencent#4291) * Fix linux build error(Tencent#4265) (Tencent#4294) Co-authored-by: wangyu <786794414@qq.com> * general cpu feature detection on macos/ios, enable bf16 and i8mm on a15 a16 and m2 (Tencent#4300) * x86 unified fc fp32/fp16s (Tencent#4303) * more fma * more transpose utility function * Bump pypa/cibuildwheel from 2.11.1 to 2.11.2 (Tencent#4308) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.11.1 to 2.11.2. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.11.1...v2.11.2) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * pnnx pytorch 1.13 (Tencent#4314) * fix Tencent#4315 (Tencent#4316) * get_physical_cpu_count api family (Tencent#4302) * get_physical_cpu_count api family * set default to physical big cpu * always treat smt core as big core * is_smt_cpu * get max freq mhz on windows * windows thread affinity * groupnorm 1d/2d/4d (Tencent#4312) * fix slice end index, fix fp16 model weight alignment (Tencent#4317) * tencent ci test-coverage pnnx (Tencent#4305) * RVV: BatchNorm with fp16s(a) support (Tencent#4075) * RVV: InstanceNorm with fp16s(a) support (Tencent#4078) * fix ci pnnx build * fold new_full and full_like (Tencent#4323) * pnnx convert nn.Softmax2d (Tencent#4324) * pnnx convert fold unfold (Tencent#4325) * support yolov5 6.2 (Tencent#4328) * implement ncnn fold and unfold (Tencent#4326) * pnnx load gpu torchscript and reset device (Tencent#4330) * fix:pnnx-softmax (Tencent#4333) * pnnx save onnx zero (Tencent#4077) * save foldable constants in file for reducing memory usage (Tencent#4337) * match inplace slice copy pattern, rewrite copy uses (Tencent#4338) * add vector optimization for loongarch64 (Tencent#4242) * ci loongarch64 lsx (Tencent#4344) * gridsample op support (Tencent#4288) Co-authored-by: LRY89757 <LRY89757@users.noreply.github.com> Co-authored-by: nihuini <nihuini@tencent.com> Co-authored-by: nihui <shuizhuyuanluo@126.com> * squeeze and expanddims 4d (Tencent#4346) * implement MultiheadAttention kdim vdim (Tencent#4347) * pnnx convert torch bitwise left_shift right_shift (Tencent#4349) * pnnx fp16 option for ncnn and onnx weight type (Tencent#4350) * pnnx fuse more function to module (Tencent#4351) * pnnx fuse more function to module * rename some pass name * fuse adjacent reshape, fuse pad conv2d * fuse pad conv1d * split tests (Tencent#4354) * Support mat.numpy() in Python (Tencent#4356) * Fix typo in stb_image.h (Tencent#4358) exitting -> exiting * Fix windows-arm64 build for non-neon case (Tencent#4227) * update release ci (Tencent#4359) * update release ci * find modern glslang * parallel jobs on windows * Fix c api allocator (Tencent#4360) * add some c_api interfaces related to allocator setup. * fix errors in allocator parameters in c_api. * test c api allocator Co-authored-by: zhangtongshe <yuyuyezi@vip.qq.com> * update glslang (Tencent#4361) * disable out-of-line atomics since ndk23+ for resolving linking issue with old ndk (Tencent#4362) * I added one more project to the list of examples. (Tencent#4205) * Dedicated to coloring black and white photographs. * add example project link (Tencent#4365) * fix(pybind11): build error (Tencent#4368) * fix openmp affinity abort when cpu goes offline (Tencent#4370) * Update release-python.yml * small fixes * unpack list input * Remove LSTM2 * fix LSTM Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Menci <huanghaorui301@gmail.com> Co-authored-by: luqiang guo <702572275@qq.com> Co-authored-by: Lry89757 <77330637+LRY89757@users.noreply.github.com> Co-authored-by: magicse <magicse@users.noreply.github.com> Co-authored-by: Zhuo Zhang <imzhuo@foxmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: 汤圆奶昔 <47135403+tonori@users.noreply.github.com> Co-authored-by: Xavier Hsinyuan <me@lstlx.com> Co-authored-by: thelastlin <thelastlin@users.noreply.github.com> Co-authored-by: nihui <shuizhuyuanluo@126.com> Co-authored-by: 柚木鉉 <740291272@qq.com> Co-authored-by: Zhang Ge <sjtu.zg123@gmail.com> Co-authored-by: ZhangGe6 <ZhangGe6@users.noreply.github.com> Co-authored-by: LinHe <LinHe.Lurking@gmail.com> Co-authored-by: LinHeLurking <LinHeLurking@users.noreply.github.com> Co-authored-by: nihuini <nihuini@tencent.com> Co-authored-by: MisakaBit <MisakaBit@gmail.com> Co-authored-by: LiuYi-Up <73060646+LiuYi-Up@users.noreply.github.com> Co-authored-by: 陸言 <robinluaa@outlook.com> Co-authored-by: miemie2013 <53960695+miemie2013@users.noreply.github.com> Co-authored-by: Eahow Chen <15228088+veahow@users.noreply.github.com> Co-authored-by: veahow <veahow@users.noreply.github.com> Co-authored-by: li mengyang <hwdefcom@outlook.com> Co-authored-by: Yoh <wpz_yoh@163.com> Co-authored-by: Caize Wu <zepanwucai@gmail.com> Co-authored-by: bestpower <wangyu117136@gmail.com> Co-authored-by: wangyu <786794414@qq.com> Co-authored-by: shaoshengsong <30892500+shaoshengsong@users.noreply.github.com> Co-authored-by: WuJinxuan <2456510228@qq.com> Co-authored-by: junchao-loongson <68935141+junchao-loongson@users.noreply.github.com> Co-authored-by: LRY89757 <LRY89757@users.noreply.github.com> Co-authored-by: Ikko Ashimine <eltociear@gmail.com> Co-authored-by: zhangtongshe <yuyuyezi@vip.qq.com> Co-authored-by: tpoisonooo <khj.application@aliyun.com>

LRY89757 and others added 2 commits August 15, 2022 23:21

Finish the gelu x86 intrinsics

aeff679

apply code-format changes

eb599a7

LRY89757 changed the title ~~[Gelu x86] Finish intrinsic with elempack merged(fast version)~~ [WIP][Gelu x86] Finish intrinsic with elempack merged(fast version) Aug 16, 2022

LRY89757 and others added 2 commits August 16, 2022 22:07

Finish the fast tanh x86 simd impl

4071222

apply code-format changes

a35fe86

LRY89757 closed this Aug 16, 2022

LRY89757 reopened this Aug 16, 2022

LRY89757 changed the title ~~[WIP][Gelu x86] Finish intrinsic with elempack merged(fast version)~~ [Gelu x86] Finish intrinsic with elempack merged(fast version) Aug 16, 2022

nihui requested changes Sep 17, 2022

View reviewed changes

src/layer/x86/gelu_x86.h Outdated Show resolved Hide resolved

src/layer/x86/gelu_x86.cpp Outdated Show resolved Hide resolved

src/layer/x86/gelu_x86.cpp Outdated Show resolved Hide resolved

src/layer/x86/gelu_x86.cpp Outdated Show resolved Hide resolved

LRY89757 and others added 2 commits September 17, 2022 14:41

Merge branch 'Tencent:master' into gelu

778bce7

Add the create_pipeline when non-fastgelu and improve the format

6e9cf57

LRY89757 closed this Sep 17, 2022

LRY89757 reopened this Sep 17, 2022

nihui merged commit 5eb56b2 into Tencent:master Sep 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gelu x86] Finish intrinsic with elempack merged(fast version) #4144

[Gelu x86] Finish intrinsic with elempack merged(fast version) #4144

LRY89757 commented Aug 15, 2022 •

edited

Loading

codecov-commenter commented Aug 15, 2022 •

edited

Loading

LRY89757 commented Aug 16, 2022 •

edited

Loading

LRY89757 commented Aug 16, 2022 •

edited

Loading

nihui commented Sep 17, 2022

LRY89757 commented Sep 17, 2022

LRY89757 commented Sep 17, 2022 •

edited

Loading

nihui commented Sep 17, 2022

nihui commented Sep 18, 2022

[Gelu x86] Finish intrinsic with elempack merged(fast version) #4144

[Gelu x86] Finish intrinsic with elempack merged(fast version) #4144

Conversation

LRY89757 commented Aug 15, 2022 • edited Loading

codecov-commenter commented Aug 15, 2022 • edited Loading

Codecov Report

LRY89757 commented Aug 16, 2022 • edited Loading

LRY89757 commented Aug 16, 2022 • edited Loading

nihui commented Sep 17, 2022

LRY89757 commented Sep 17, 2022

LRY89757 commented Sep 17, 2022 • edited Loading

nihui commented Sep 17, 2022

nihui commented Sep 18, 2022

LRY89757 commented Aug 15, 2022 •

edited

Loading

codecov-commenter commented Aug 15, 2022 •

edited

Loading

LRY89757 commented Aug 16, 2022 •

edited

Loading

LRY89757 commented Aug 16, 2022 •

edited

Loading

LRY89757 commented Sep 17, 2022 •

edited

Loading