Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layer Norm x86 SIMD Optimizations #4065

Merged
merged 24 commits into from Jul 29, 2022
Merged

Layer Norm x86 SIMD Optimizations #4065

merged 24 commits into from Jul 29, 2022

Conversation

LinHeLurking
Copy link
Contributor

This PR provides some SIMD optimizations for LayerNorm, both for packed or unpacked tensors.

@tencent-adm
Copy link

tencent-adm commented Jul 21, 2022

CLA assistant check
All committers have signed the CLA.

@codecov-commenter
Copy link

codecov-commenter commented Jul 21, 2022

Codecov Report

Merging #4065 (1b118f7) into master (4f414c1) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4065      +/-   ##
==========================================
+ Coverage   94.41%   94.43%   +0.02%     
==========================================
  Files         745      748       +3     
  Lines      178496   179052     +556     
==========================================
+ Hits       168533   169094     +561     
+ Misses       9963     9958       -5     
Impacted Files Coverage Δ
src/layer/x86/layernorm_x86.cpp 100.00% <100.00%> (ø)
src/cpu.cpp 62.11% <0.00%> (-0.24%) ⬇️
src/mat.h 89.82% <0.00%> (ø)
src/layer/riscv/gru_riscv.cpp 96.56% <0.00%> (ø)
src/layer/riscv/rvv_mathfun.h 100.00% <0.00%> (ø)
src/layer/riscv/cast_riscv.cpp 95.58% <0.00%> (ø)
src/layer/riscv/clip_riscv.cpp 100.00% <0.00%> (ø)
src/layer/riscv/crop_riscv.cpp 97.26% <0.00%> (ø)
src/layer/riscv/gelu_riscv.cpp 100.00% <0.00%> (ø)
src/layer/riscv/mish_riscv.cpp 100.00% <0.00%> (ø)
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4158e63...1b118f7. Read the comment docs.

Copy link
Member

@nihui nihui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for simd register horizontal sum, there is utility function in x86_usability.h
for avx/fma multiply-add intrinsics, there is wrapper comp_fmadd function in x86_usability.h

use size * elempack as the loop count when applicable, so you can merge multiple for loop code blocks into one

@LinHeLurking
Copy link
Contributor Author

for simd register horizontal sum, there is utility function in x86_usability.h for avx/fma multiply-add intrinsics, there is wrapper comp_fmadd function in x86_usability.h

use size * elempack as the loop count when applicable, so you can merge multiple for loop code blocks into one

I think I do not need SIMD register horizontal summation because the length of tensor varies. But AVX/FMA fmadd wrappers in x86_usability.h are useful. I've adopted them.

But I'm not sure how to merge multiple loop blocks into one by using size * elempack as the loop count. I think the packed layout is different with the normal one. And when handling unpacked tensors, I can calculate along dimensions. But when handling packed tensors, branches must be dispatched according to the elempack parameter, which is implicitly decided by CPU vector instruction set availability.

@nihui
Copy link
Member

nihui commented Jul 22, 2022

suppose v is data from tensor, and a is the weight(such as alpha beta gamma etc.)

pack1

for loop1
{
    v1 + a1
}

pack4

for loop4
{
    v4 + a4
}

pack8

for loop8
{
    v8 + a8
}

pack16

for loop16
{
    v16 + a16
}

unified pack

// prepare a4 a8 a16 if pack1
a1 = a1
a4 = a1 a1 a1 a1
a8 = a4 a4
a16 = a8 a8

// prepare a8 a16 if pack4
a1 = undefined
a4 = a4
a8 = a4 a4
a16 = a8 a8

// prepare a16 if pack8
a1 = undefined
a4 = undefined
a8 = a8
a16 = a8 a8

//  if pack16
a1 = undefined
a4 = undefined
a8 = undefined
a16 = a16

for loop16
{
    v16 + a16
}
for loop8
{
    v8 + a8
}
for loop4
{
    v4 + a4
}
for loop1
{
    v1 + a1
}

@LinHeLurking
Copy link
Contributor Author

suppose v is data from tensor, and a is the weight(such as alpha beta gamma etc.)

pack1

for loop1
{
    v1 + a1
}

pack4

for loop4
{
    v4 + a4
}

pack8

for loop8
{
    v8 + a8
}

pack16

for loop16
{
    v16 + a16
}

unified pack

// prepare a4 a8 a16 if pack1
a1 = a1
a4 = a1 a1 a1 a1
a8 = a4 a4
a16 = a8 a8

// prepare a8 a16 if pack4
a1 = undefined
a4 = a4
a8 = a4 a4
a16 = a8 a8

// prepare a16 if pack8
a1 = undefined
a4 = undefined
a8 = a8
a16 = a8 a8

//  if pack16
a1 = undefined
a4 = undefined
a8 = undefined
a16 = a16

for loop16
{
    v16 + a16
}
for loop8
{
    v8 + a8
}
for loop4
{
    v4 + a4
}
for loop1
{
    v1 + a1
}

Thanks. Now I managed to merge many cases into one.

Copy link
Member

@nihui nihui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add copyright header for new source

src/layer/x86/layernorm_x86.h Outdated Show resolved Hide resolved
@nihui
Copy link
Member

nihui commented Jul 25, 2022

diff coverage is not good enough from https://app.codecov.io/gh/Tencent/ncnn/pull/4065
it may be a good idea to steal some test cases from #4060
add more tests to test_layernorm.cpp

@LinHeLurking
Copy link
Contributor Author

diff coverage is not good enough from https://app.codecov.io/gh/Tencent/ncnn/pull/4065 it may be a good idea to steal some test cases from #4060 add more tests to test_layernorm.cpp

I've added some test cases about 16-packed tensors.

But I'm confused about the diff coverage. Most files shown in https://app.codecov.io/gh/Tencent/ncnn/pull/4065 are not modified or even influenced by this PR. I have no idea that how LayerNorm_x86 affects them.

@nihui
Copy link
Member

nihui commented Jul 25, 2022

diff coverage is not good enough from https://app.codecov.io/gh/Tencent/ncnn/pull/4065 it may be a good idea to steal some test cases from #4060 add more tests to test_layernorm.cpp

I've added some test cases about 16-packed tensors.

But I'm confused about the diff coverage. Most files shown in https://app.codecov.io/gh/Tencent/ncnn/pull/4065 are not modified or even influenced by this PR. I have no idea that how LayerNorm_x86 affects them.

It often fails in that way.
Care about modified files we known.

src/layer/x86/layernorm_x86.h Outdated Show resolved Hide resolved
src/layer/x86/layernorm_x86.cpp Outdated Show resolved Hide resolved
src/layer/x86/layernorm_x86.cpp Outdated Show resolved Hide resolved
@nihui nihui merged commit 03f2ad3 into Tencent:master Jul 29, 2022
@nihui
Copy link
Member

nihui commented Jul 29, 2022

Thanks for your contribution !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants