Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add riscv float32 gemm #4903

Merged
merged 46 commits into from Oct 20, 2023
Merged

Add riscv float32 gemm #4903

merged 46 commits into from Oct 20, 2023

Conversation

Xinyu302
Copy link
Contributor

@Xinyu302 Xinyu302 commented Aug 3, 2023

No description provided.

@tencent-adm
Copy link

tencent-adm commented Aug 3, 2023

CLA assistant check
All committers have signed the CLA.

@Xinyu302 Xinyu302 changed the title WIP: add riscv gemm [WIP]: add riscv gemm Aug 3, 2023
@Xinyu302 Xinyu302 marked this pull request as draft August 3, 2023 08:58
@nihui
Copy link
Member

nihui commented Aug 10, 2023

期待你的 riscv gemm !这个任务挺难的(

@Xinyu302
Copy link
Contributor Author

Xinyu302 commented Aug 21, 2023

仿照arm64版本的gemm,有了一个float32的基本实现,可以通过test_gemm和test_gemm1的测试。
由于riscv向量化里面没有zip,transpose*系列函数做了基本的实现。
同时riscv里面也没有vst2q,vst4q这种StoreZipFloat的指令,仅弄了一个能用的实现。
VL也没用上。

@Xinyu302 Xinyu302 marked this pull request as ready for review August 21, 2023 07:03
Copy link
Member

@nihui nihui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

很多intrinsics在新的gcc中无法编译了,这里需要兼容,具体看下ci中的错误

@codecov-commenter
Copy link

codecov-commenter commented Sep 14, 2023

Codecov Report

Merging #4903 (0f67295) into master (4b97730) will decrease coverage by 0.10%.
Report is 93 commits behind head on master.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4903      +/-   ##
==========================================
- Coverage   94.90%   94.81%   -0.10%     
==========================================
  Files         779      769      -10     
  Lines      223166   239834   +16668     
==========================================
+ Hits       211795   227394   +15599     
- Misses      11371    12440    +1069     
Files Coverage Δ
src/layer/riscv/gemm_riscv.cpp 99.46% <ø> (ø)
src/layer/riscv/riscv_usability.h 100.00% <100.00%> (ø)

... and 83 files with indirect coverage changes

@nihui
Copy link
Member

nihui commented Sep 15, 2023

nT 初始化成了随机数?

Rewrite some intrinsic now performance OK
@Xinyu302
Copy link
Contributor Author

qemu的时间也太不准了啊...
用全志D1测 m=n=k,naive没有优化,tile只分块没有向量化,vectorize使用向量化,单位微秒

naive tile vectorize
32 309 229 355
128 17079 8019 5280
512 1010970 466917 211145
1024 7826209 3556337 1576699

@Xinyu302 Xinyu302 requested a review from nihui September 18, 2023 05:02
src/layer/riscv/gemm_riscv.h Outdated Show resolved Hide resolved
@@ -86,6 +86,284 @@ static inline vfloat32m8_t vle32_v_f32m8_f32m1(const float* ptr)
return vloxei32_v_f32m8(ptr, bindex, vl);
}

#define VL 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要在h里面直接define VL 4
会污染到其他所有include这个h的代码

我看 transpose8x8_ps 调用的地方,基本都有前面的 load / 后面的 store,这么看或许根本不需要这个 transpose8x8_ps ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transpose8x8_ps和transpose4x4_ps的情况好像确实可以这样做

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个还没修,有时间了再改一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要求transpose之后向量寄存器要存到的内存是连续的,这样就可以不用transpose8x8_ps里面的tmp数组了,但是看了一下gemm中现有的transpose满足这个条件的并不多。要求transpose之后向量寄存器要存到的内存是连续的,这样就可以不用transpose8x8_ps里面的tmp数组了,但是看了一下gemm中现有的transpose满足这个条件的并不多。

src/layer/riscv/gemm_riscv.cpp Show resolved Hide resolved
@nihui nihui closed this Sep 20, 2023
@nihui nihui reopened this Sep 20, 2023
@nihui
Copy link
Member

nihui commented Sep 21, 2023

如果自认为完成,请去掉标题的 WIP

@Xinyu302
Copy link
Contributor Author

Xinyu302 commented Sep 21, 2023

如果自认为完成,请去掉标题的 WIP

其实还有fp16的gemm没做,但是最近这两天又比较忙,想等到周末再看看。评估一下,如果真的做不完了就把现在的PR改成"Add riscv float32 gemm".

@Xinyu302
Copy link
Contributor Author

如果自认为完成,请去掉标题的 WIP

没有在riscv中找到与vfmlalq_laneq_low_f16类似的f16乘f16最后和f32累加的intrinsic函数。在计算前先讲f32转换成f16,如果还是需要在运算时将f16转换成f32,那么这样做能取得足够的收益吗?

@Xinyu302 Xinyu302 changed the title [WIP]: add riscv gemm Add riscv float32 gemm Sep 22, 2023
@nihui
Copy link
Member

nihui commented Sep 23, 2023

如果自认为完成,请去掉标题的 WIP

没有在riscv中找到与vfmlalq_laneq_low_f16类似的f16乘f16最后和f32累加的intrinsic函数。在计算前先讲f32转换成f16,如果还是需要在运算时将f16转换成f32,那么这样做能取得足够的收益吗?

vfwmul_vf_f32m2

可以参考 convolution_packn_fp16s.h fp16s部分写法

@Xinyu302
Copy link
Contributor Author

如果自认为完成,请去掉标题的 WIP

没有在riscv中找到与vfmlalq_laneq_low_f16类似的f16乘f16最后和f32累加的intrinsic函数。在计算前先讲f32转换成f16,如果还是需要在运算时将f16转换成f32,那么这样做能取得足够的收益吗?

vfwmul_vf_f32m2

可以参考 convolution_packn_fp16s.h fp16s部分写法

国庆之前估计没时间写了,放假的时候应该可以搞一下,完成“利用risc-v vector和zfh(fp16)扩展优化实现gemm_riscv.cpp,使用qemu测试”的目标

@nihui nihui closed this Oct 11, 2023
@nihui nihui reopened this Oct 11, 2023
@github-actions github-actions bot added the riscv label Oct 11, 2023
@nihui nihui self-requested a review October 16, 2023 03:03
@nihui nihui self-requested a review October 16, 2023 03:03
@nihui nihui merged commit b82d395 into Tencent:master Oct 20, 2023
27 checks passed
@nihui
Copy link
Member

nihui commented Oct 20, 2023

Thanks for your contribution !

@Xinyu302 Xinyu302 deleted the add-riscv-gemm branch January 20, 2024 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants