Add riscv float32 gemm #4903

Xinyu302 · 2023-08-03T02:56:59Z

No description provided.

tencent-adm · 2023-08-03T02:57:16Z

All committers have signed the CLA.

nihui · 2023-08-10T02:44:14Z

期待你的 riscv gemm ！这个任务挺难的（

Xinyu302 · 2023-08-21T06:50:39Z

仿照arm64版本的gemm，有了一个float32的基本实现，可以通过test_gemm和test_gemm1的测试。
由于riscv向量化里面没有zip，transpose*系列函数做了基本的实现。
同时riscv里面也没有vst2q，vst4q这种StoreZipFloat的指令，仅弄了一个能用的实现。
VL也没用上。

nihui

很多intrinsics在新的gcc中无法编译了，这里需要兼容，具体看下ci中的错误

codecov-commenter · 2023-09-14T02:56:39Z

Codecov Report

Merging #4903 (0f67295) into master (4b97730) will decrease coverage by 0.10%.
Report is 93 commits behind head on master.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4903      +/-   ##
==========================================
- Coverage   94.90%   94.81%   -0.10%     
==========================================
  Files         779      769      -10     
  Lines      223166   239834   +16668     
==========================================
+ Hits       211795   227394   +15599     
- Misses      11371    12440    +1069

Files	Coverage Δ
src/layer/riscv/gemm_riscv.cpp	`99.46% <ø> (ø)`
src/layer/riscv/riscv_usability.h	`100.00% <100.00%> (ø)`

... and 83 files with indirect coverage changes

nihui · 2023-09-15T03:08:39Z

nT 初始化成了随机数？

Rewrite some intrinsic now performance OK

Xinyu302 · 2023-09-17T19:33:24Z

qemu的时间也太不准了啊...
用全志D1测 m=n=k，naive没有优化，tile只分块没有向量化，vectorize使用向量化，单位微秒

	naive	tile	vectorize
32	309	229	355
128	17079	8019	5280
512	1010970	466917	211145
1024	7826209	3556337	1576699

src/layer/riscv/gemm_riscv.h

nihui · 2023-09-20T02:47:16Z

src/layer/riscv/riscv_usability.h

@@ -86,6 +86,284 @@ static inline vfloat32m8_t vle32_v_f32m8_f32m1(const float* ptr)
    return vloxei32_v_f32m8(ptr, bindex, vl);
 }

+#define VL 4


不要在h里面直接define VL 4
会污染到其他所有include这个h的代码

我看 transpose8x8_ps 调用的地方，基本都有前面的 load / 后面的 store，这么看或许根本不需要这个 transpose8x8_ps ?

transpose8x8_ps和transpose4x4_ps的情况好像确实可以这样做

这个还没修，有时间了再改一下

要求transpose之后向量寄存器要存到的内存是连续的，这样就可以不用transpose8x8_ps里面的tmp数组了，但是看了一下gemm中现有的transpose满足这个条件的并不多。要求transpose之后向量寄存器要存到的内存是连续的，这样就可以不用transpose8x8_ps里面的tmp数组了，但是看了一下gemm中现有的transpose满足这个条件的并不多。

src/layer/riscv/gemm_riscv.cpp

nihui · 2023-09-21T06:06:23Z

如果自认为完成，请去掉标题的 WIP

Xinyu302 · 2023-09-21T12:17:13Z

如果自认为完成，请去掉标题的 WIP

其实还有fp16的gemm没做，但是最近这两天又比较忙，想等到周末再看看。评估一下，如果真的做不完了就把现在的PR改成"Add riscv float32 gemm".

Xinyu302 · 2023-09-22T11:32:16Z

如果自认为完成，请去掉标题的 WIP

没有在riscv中找到与vfmlalq_laneq_low_f16类似的f16乘f16最后和f32累加的intrinsic函数。在计算前先讲f32转换成f16，如果还是需要在运算时将f16转换成f32，那么这样做能取得足够的收益吗？

nihui · 2023-09-23T03:58:14Z

如果自认为完成，请去掉标题的 WIP

没有在riscv中找到与vfmlalq_laneq_low_f16类似的f16乘f16最后和f32累加的intrinsic函数。在计算前先讲f32转换成f16，如果还是需要在运算时将f16转换成f32，那么这样做能取得足够的收益吗？

vfwmul_vf_f32m2

可以参考 convolution_packn_fp16s.h fp16s部分写法

Xinyu302 · 2023-09-26T06:34:50Z

如果自认为完成，请去掉标题的 WIP

没有在riscv中找到与vfmlalq_laneq_low_f16类似的f16乘f16最后和f32累加的intrinsic函数。在计算前先讲f32转换成f16，如果还是需要在运算时将f16转换成f32，那么这样做能取得足够的收益吗？

vfwmul_vf_f32m2

可以参考 convolution_packn_fp16s.h fp16s部分写法

国庆之前估计没时间写了，放假的时候应该可以搞一下，完成“利用risc-v vector和zfh(fp16)扩展优化实现gemm_riscv.cpp，使用qemu测试”的目标

nihui · 2023-10-20T07:43:42Z

Thanks for your contribution !

add defination of gemm_riscv

c4be138

Xinyu302 changed the title ~~WIP: add riscv gemm~~ [WIP]: add riscv gemm Aug 3, 2023

Xinyu302 marked this pull request as draft August 3, 2023 08:58

Xinyu302 added 6 commits August 19, 2023 07:29

add pack_A_tile pack_B_tile in gemm_riscv

8d0b0c7

transpose and gemm

d8255f9

compile right, before add pipeine

523a548

finish gemm_riscv but has bug

b674cf1

add create_pipeline function in gemm_riscv

8cb184f

fix bug in transpose_unpack_output_tile

fcb974f

Xinyu302 marked this pull request as ready for review August 21, 2023 07:03

nihui requested changes Sep 13, 2023

View reviewed changes

Xinyu302 added 7 commits September 13, 2023 11:58

add #if __riscv_vector to support device which cannot run RISCV-V

28dcb7e

add C906 macro, in other case, now use naive implementation

2a32961

modify transpose kernel

45aaf1f

change C906 macro location

f98d0e7

modify store_float32_v2, store_float_v4

c373531

delete useless functions

a5c2a90

delete annotations

9424f85

Xinyu302 added 3 commits September 14, 2023 03:29

Add #include cpu.h

71f382b

add static for pack_A_tile

b0022d6

delete annotation

366d6cb

Xinyu302 added 4 commits September 15, 2023 10:09

replace vlseg2e32_v_f32m1x2 with vlseg2e32_v_f32m1

5d6a55f

fix small bugs

fca6d4b

remove C906 macro

31db10d

add nT = 0

c4a4580

Merge pull request #1 from Xinyu302/gemm-time-test

9276321

Rewrite some intrinsic now performance OK

delete useless examples

02feba4

Xinyu302 requested a review from nihui September 18, 2023 05:02

delte useless function in riscv_usability.h

eebe280

nihui reviewed Sep 20, 2023

View reviewed changes

Xinyu302 and others added 4 commits September 20, 2023 03:48

LAYER_GEMM_RISCV_H

578a8e8

delete define VL

c48b8c4

add annotation

c9dc401

apply code-format changes

6ccb081

nihui closed this Sep 20, 2023

nihui reopened this Sep 20, 2023

Xinyu302 changed the title ~~[WIP]: add riscv gemm~~ Add riscv float32 gemm Sep 22, 2023

delete riscv_zfh comment

0f67295

nihui closed this Oct 11, 2023

nihui reopened this Oct 11, 2023

github-actions bot added the riscv label Oct 11, 2023

nihui approved these changes Oct 16, 2023

View reviewed changes

nihui self-requested a review October 16, 2023 03:03

nihui approved these changes Oct 16, 2023

View reviewed changes

nihui self-requested a review October 16, 2023 03:03

nihui merged commit b82d395 into Tencent:master Oct 20, 2023
27 checks passed

Porkepix mentioned this pull request Oct 27, 2023

ncnn 20231027 Homebrew/homebrew-core#152559

Merged

Xinyu302 deleted the add-riscv-gemm branch January 20, 2024 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add riscv float32 gemm #4903

Add riscv float32 gemm #4903

Xinyu302 commented Aug 3, 2023

tencent-adm commented Aug 3, 2023 •

edited

nihui commented Aug 10, 2023

Xinyu302 commented Aug 21, 2023 •

edited

nihui left a comment

codecov-commenter commented Sep 14, 2023 •

edited

nihui commented Sep 15, 2023

Xinyu302 commented Sep 17, 2023

nihui Sep 20, 2023

Xinyu302 Sep 20, 2023

Xinyu302 Sep 20, 2023

Xinyu302 Sep 20, 2023

nihui commented Sep 21, 2023

Xinyu302 commented Sep 21, 2023 •

edited

Xinyu302 commented Sep 22, 2023

nihui commented Sep 23, 2023

Xinyu302 commented Sep 26, 2023

nihui commented Oct 20, 2023

Add riscv float32 gemm #4903

Add riscv float32 gemm #4903

Conversation

Xinyu302 commented Aug 3, 2023

tencent-adm commented Aug 3, 2023 • edited

nihui commented Aug 10, 2023

Xinyu302 commented Aug 21, 2023 • edited

nihui left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 14, 2023 • edited

Codecov Report

nihui commented Sep 15, 2023

Xinyu302 commented Sep 17, 2023

nihui Sep 20, 2023

Choose a reason for hiding this comment

Xinyu302 Sep 20, 2023

Choose a reason for hiding this comment

Xinyu302 Sep 20, 2023

Choose a reason for hiding this comment

Xinyu302 Sep 20, 2023

Choose a reason for hiding this comment

nihui commented Sep 21, 2023

Xinyu302 commented Sep 21, 2023 • edited

Xinyu302 commented Sep 22, 2023

nihui commented Sep 23, 2023

Xinyu302 commented Sep 26, 2023

nihui commented Oct 20, 2023

tencent-adm commented Aug 3, 2023 •

edited

Xinyu302 commented Aug 21, 2023 •

edited

codecov-commenter commented Sep 14, 2023 •

edited

Xinyu302 commented Sep 21, 2023 •

edited