add vector optimization for loongarch64 #4242

junchao-loongson · 2022-10-08T08:18:34Z

#### lsx on

loop_count = 10
num_threads = 4
powersave = 0
gpu_device = 0
cooling_down = 1
          squeezenet  min =   30.11  max =   30.25  avg =   30.17
     squeezenet_int8  min =   36.20  max =   36.77  avg =   36.39
           mobilenet  min =   54.16  max =   55.30  avg =   54.94
      mobilenet_int8  min =   73.63  max =   84.76  avg =   75.02
        mobilenet_v2  min =   29.86  max =   30.04  avg =   29.93
        mobilenet_v3  min =   29.87  max =   30.27  avg =   29.96
          shufflenet  min =   14.28  max =   14.43  avg =   14.33
       shufflenet_v2  min =   14.69  max =   15.15  avg =   14.78
             mnasnet  min =   33.19  max =   36.34  avg =   33.63
     proxylessnasnet  min =   41.12  max =   41.42  avg =   41.32
     efficientnet_b0  min =   56.44  max =   56.90  avg =   56.57
   efficientnetv2_b0  min =   57.82  max =   70.10  avg =   59.27
        regnety_400m  min =   42.30  max =   43.77  avg =   42.66
           blazeface  min =    3.90  max =    3.97  avg =    3.92
           googlenet  min =   96.57  max =   97.67  avg =   97.24
      googlenet_int8  min =  114.78  max =  125.97  avg =  116.31
            resnet18  min =   76.17  max =   81.27  avg =   77.84
       resnet18_int8  min =   90.56  max =  105.29  avg =   92.35
             alexnet  min =   64.05  max =   72.82  avg =   65.37
               vgg16  min =  485.09  max =  492.48  avg =  489.88
          vgg16_int8  min =  526.74  max =  551.45  avg =  533.22
            resnet50  min =  245.49  max =  278.99  avg =  257.55
       resnet50_int8  min =  295.99  max =  327.18  avg =  310.22
      squeezenet_ssd  min =   61.70  max =   62.46  avg =   62.09
 squeezenet_ssd_int8  min =   77.95  max =   89.45  avg =   79.49
       mobilenet_ssd  min =  111.67  max =  114.42  avg =  112.48
  mobilenet_ssd_int8  min =  146.82  max =  170.33  avg =  150.07
      mobilenet_yolo  min =  285.39  max =  299.10  avg =  288.71
  mobilenetv2_yolov3  min =  119.12  max =  131.47  avg =  120.98
         yolov4-tiny  min =  153.83  max =  174.90  avg =  160.86
           nanodet_m  min =   36.52  max =   75.02  avg =   40.53
    yolo-fastest-1.1  min =   16.16  max =   19.05  avg =   16.49
      yolo-fastestv2  min =   14.61  max =   14.87  avg =   14.76
  vision_transformer  min = 1652.83  max = 1672.56  avg = 1659.27
          FastestDet  min =   18.22  max =   21.26  avg =   18.58



#### lsx off

loop_count = 10
num_threads = 4
powersave = 0
gpu_device = 0
cooling_down = 1
          squeezenet  min =   30.01  max =   30.56  avg =   30.11
     squeezenet_int8  min =   42.79  max =   57.73  avg =   45.00
           mobilenet  min =   55.12  max =   55.93  avg =   55.66
      mobilenet_int8  min =   89.75  max =   92.61  avg =   90.19
        mobilenet_v2  min =   30.62  max =   33.51  avg =   31.00
        mobilenet_v3  min =   31.04  max =   31.30  avg =   31.18
          shufflenet  min =   14.54  max =   14.72  avg =   14.61
       shufflenet_v2  min =   14.88  max =   15.47  avg =   15.01
             mnasnet  min =   33.66  max =   33.91  avg =   33.78
     proxylessnasnet  min =   41.61  max =   51.03  avg =   42.71
     efficientnet_b0  min =   58.00  max =   69.22  avg =   59.28
   efficientnetv2_b0  min =   59.64  max =   59.99  avg =   59.81
        regnety_400m  min =   42.73  max =   43.06  avg =   42.92
           blazeface  min =    3.93  max =    4.01  avg =    3.95
           googlenet  min =   97.71  max =  101.34  avg =   98.36
      googlenet_int8  min =  138.93  max =  182.00  avg =  145.42
            resnet18  min =   81.82  max =   84.90  avg =   82.62
       resnet18_int8  min =  108.41  max =  118.75  avg =  109.66
             alexnet  min =   67.48  max =   71.12  avg =   68.89
               vgg16  min =  480.30  max =  503.93  avg =  488.02
          vgg16_int8  min =  609.66  max =  628.12  avg =  620.36
            resnet50  min =  245.73  max =  250.32  avg =  247.84
       resnet50_int8  min =  357.05  max =  389.35  avg =  362.26
      squeezenet_ssd  min =   62.36  max =   73.35  avg =   63.66
 squeezenet_ssd_int8  min =   86.24  max =   89.02  avg =   87.41
       mobilenet_ssd  min =  111.42  max =  119.37  avg =  112.79
  mobilenet_ssd_int8  min =  178.72  max =  188.27  avg =  180.56
      mobilenet_yolo  min =  297.02  max =  305.32  avg =  299.69
  mobilenetv2_yolov3  min =  122.81  max =  124.30  avg =  123.32
         yolov4-tiny  min =  150.51  max =  156.30  avg =  152.37
           nanodet_m  min =   36.21  max =   36.57  avg =   36.34
    yolo-fastest-1.1  min =   15.78  max =   27.00  avg =   16.96
      yolo-fastestv2  min =   14.69  max =   14.95  avg =   14.81
  vision_transformer  min = 6423.16  max = 6451.22  avg = 6435.01
          FastestDet  min =   18.16  max =   18.46  avg =   18.32

打开 -DNCNN_BUILD_TESTS=ON 编译后运行tests未发现错误

tencent-adm · 2022-10-08T08:18:47Z

All committers have signed the CLA.

tpoisonooo · 2022-10-08T11:00:14Z

tql，然而没有加 CI

nihui · 2022-10-08T11:04:35Z

贴一个 3A4000 的数据，看起来 3A5000 反而更慢了啊

https://github.com/Tencent/ncnn/blob/master/benchmark/README.md#loongson-3a4000-gs464v-18ghz--4-with-msa128

root@3A4K:~/Desktop/ncnn-20220420/ncnn-20220420/build/benchmark$ ./benchncnn 
loop_count = 4
num_threads = 4
powersave = 0
gpu_device = -1
cooling_down = 1
          squeezenet  min =   18.31  max =   18.97  avg =   18.64
     squeezenet_int8  min =   22.11  max =   35.58  avg =   25.60
           mobilenet  min =   28.07  max =   29.68  avg =   28.64
      mobilenet_int8  min =   34.10  max =  110.13  avg =   57.77
        mobilenet_v2  min =   20.73  max =   21.48  avg =   21.09
        mobilenet_v3  min =   19.92  max =   20.11  avg =   20.02
          shufflenet  min =   13.25  max =   13.98  avg =   13.51
       shufflenet_v2  min =   12.67  max =   12.95  avg =   12.87
             mnasnet  min =   20.04  max =   20.63  avg =   20.37
     proxylessnasnet  min =   23.90  max =   24.62  avg =   24.25
     efficientnet_b0  min =   38.09  max =   56.57  avg =   43.08
   efficientnetv2_b0  min =   41.14  max =   41.82  avg =   41.36
        regnety_400m  min =   36.19  max =   37.52  avg =   36.79
           blazeface  min =    4.05  max =    4.51  avg =    4.24
           googlenet  min =   74.61  max =   87.59  avg =   78.16
      googlenet_int8  min =   85.53  max =   87.06  avg =   86.27
            resnet18  min =   64.90  max =   71.13  avg =   67.04
       resnet18_int8  min =   60.56  max =   72.30  avg =   63.62
             alexnet  min =   74.92  max =   80.70  avg =   76.49
               vgg16  min =  335.14  max =  349.20  avg =  340.92
          vgg16_int8  min =  299.33  max =  371.58  avg =  318.36
            resnet50  min =  148.97  max =  240.90  avg =  176.92
       resnet50_int8  min =  161.41  max =  256.27  avg =  186.67
      squeezenet_ssd  min =   59.74  max =   60.25  avg =   59.92
 squeezenet_ssd_int8  min =   59.38  max =  140.09  avg =   79.84
       mobilenet_ssd  min =   59.61  max =   61.20  avg =   60.63
  mobilenet_ssd_int8  min =   71.35  max =  171.46  avg =  108.97
      mobilenet_yolo  min =  176.17  max =  262.16  avg =  201.31
  mobilenetv2_yolov3  min =   79.15  max =   87.97  avg =   81.50
         yolov4-tiny  min =  113.99  max =  117.35  avg =  115.06
           nanodet_m  min =   26.27  max =   27.11  avg =   26.63
    yolo-fastest-1.1  min =   11.65  max =  117.04  avg =   38.22
      yolo-fastestv2  min =   12.03  max =   12.40  avg =   12.16

CMakeLists.txt

src/cpu.cpp

src/layer/loongarch64/msa_mathfun.h

src/layer/loongarch64/loongson_mmi.h

CMakeLists.txt

src/layer/loongarch64/absval_loongarch64.cpp

junchao-loongson · 2022-10-09T01:57:18Z

之前顺利的异常是因为我把宏定义无意中替换了，现在改回来发现问题有点多，给我点时间我再改改

junchao-loongson · 2022-10-11T01:37:16Z

> # cat do_test.sh                                                                             [±master]

for ncnn_test in `ls test_*`
do
	 echo "----- "$ncnn_test
	 if test $ncnn_test != "test_reduction"
	 then
		./$ncnn_test
	 fi
done

> # bash do_test.sh                                                                                                [±master]
----- test_absval
----- test_batchnorm
----- test_bias
----- test_binaryop
----- test_bnll
----- test_c_api
----- test_cast
----- test_clip
----- test_concat
----- test_convolution
value not match  at c:4 d:0 h:0 w:0    expect 0.754376 but got 1.000000
test_layer_cpu failed
test_layer Convolution failed use_packing_layout=0 use_fp16_packed=0 use_fp16_storage=0 use_fp16_arithmetic=0 use_shader_pack8=0 use_bf16_storage=0 use_image_storage=0 use_sgemm_convolution=1 use_winograd_convolution=1
test_convolution_int8 failed w=9 h=7 c=7 outch=7 kernel=1 dilation=1 stride=1 pad=0 bias=1 requant=0 act=4 actparams=[-0.136048,0.266064]
----- test_convolution1d
----- test_convolution3d
----- test_convolutiondepthwise
value not match  at c:0 d:0 h:0 w:0    expect 0.930540 but got 0.500000
test_layer_cpu failed
test_layer ConvolutionDepthWise failed use_packing_layout=0 use_fp16_packed=0 use_fp16_storage=0 use_fp16_arithmetic=0 use_shader_pack8=0 use_bf16_storage=0 use_image_storage=0 use_sgemm_convolution=1 use_winograd_convolution=1
test_convolutiondepthwise_int8 failed w=15 h=7 c=8 outch=8 kernel=3 dilation=1 stride=1 pad=1 bias=0 group=2 requant=0 act=4 actparams=[-0.370383,0.139109]
----- test_convolutiondepthwise1d
----- test_convolutiondepthwise3d
----- test_cpu
----- test_crop
----- test_deconvolution
----- test_deconvolution1d
----- test_deconvolution3d
----- test_deconvolutiondepthwise
----- test_deconvolutiondepthwise1d
----- test_deconvolutiondepthwise3d
----- test_deepcopy
----- test_deformableconv2d
----- test_dequantize
----- test_dropout
----- test_einsum
----- test_eltwise
----- test_elu
----- test_expanddims
----- test_flatten
----- test_gelu
----- test_gemm
----- test_groupnorm
----- test_gru
----- test_hardsigmoid
----- test_hardswish
----- test_innerproduct
----- test_instancenorm
----- test_interp
----- test_layernorm
----- test_lrn
----- test_lstm
----- test_matmul
----- test_mat_pixel
----- test_mat_pixel_affine
----- test_mat_pixel_drawing
----- test_mat_pixel_resize
----- test_mat_pixel_rotate
----- test_memorydata
----- test_mish
----- test_multiheadattention
----- test_noop
----- test_normalize
----- test_packing
----- test_padding
----- test_permute
----- test_pixelshuffle
----- test_pooling
----- test_pooling1d
----- test_pooling3d
----- test_power
----- test_prelu
----- test_priorbox
----- test_quantize
----- test_reduction
----- test_relu
----- test_reorg
----- test_requantize
value not match  at c:1 d:0 h:0 w:0    expect 26.000000 but got -2.000000
test_layer_cpu failed
test_layer Requantize failed use_packing_layout=1 use_fp16_packed=0 use_fp16_storage=0 use_fp16_arithmetic=0 use_shader_pack8=0 use_bf16_storage=0 use_image_storage=0 use_sgemm_convolution=0 use_winograd_convolution=0
test_requantize failed a.dims=3 a=(7 9 12) scale_in_data_size=1 scale_out_data_size=1 bias_data_size=12 act=0 actparams=[0.000000,0.000000]
----- test_reshape
----- test_rnn
----- test_roialign
----- test_roipooling
----- test_scale
----- test_selu
----- test_shufflechannel
----- test_sigmoid
----- test_slice
----- test_softmax
----- test_softplus
----- test_squeeze
----- test_squeezenet
----- test_swish
----- test_tanh
----- test_tile
----- test_unaryop
----- test_yolov3detectionoutput

现在测试还有3个错误，大佬看看

其中madd msub vbitsel 我临时修改lsxintrin.h 交换了下参数顺序以保证正确。后续我改下layer/loongarch 下代码

junchao-loongson · 2022-10-27T12:12:31Z

单元测试通过，以上图片为性能测试对比，由于我手里机器有限，没有找到两台主频相同的3A5000和3A4000做测试

测试命令：
../build/benchmark/benchncnn 10 $(nproc) 0 0

3A5000配置：

CPU:  2.5G
Loongson-3A5000-7A1000-1w-V0.1-CRB
内存：8G*2

3A4000配置：

CPU 1.8G
内存 8G*2  DDR4  Speed: 2133 MT/s

有相近配置的3A4000的大佬可以可以跑一个性能测试对比下，我这台3A4000有点拉跨

Yoh-Z · 2022-10-27T12:15:48Z

tql

CMakeLists.txt

src/cpu.cpp

src/cpu.h

nihui · 2022-10-28T02:39:07Z

src/layer/loongarch/absval_loongarch.cpp

+        {
+            __builtin_prefetch(ptr + 16);
+            __m128i _p = __lsx_vld(ptr, 0);
+            v4f32 _outp = (v4f32)__lsx_vbitclri_w(_p, 31);


__lsx_vst accepts __m128i type, no v4f32 casting needed
this comment applies all through the patch

只能尽量使用_m128i，__lsx_vf 开头的函数要求v4f32类型参数，还得把__m128i 转到v4f32

src/layer/loongarch/sigmoid_loongarch.cpp

src/layer/loongarch/msa_mathfun.h

codecov-commenter · 2022-10-29T08:42:35Z

Codecov Report

Merging #4242 (a29f701) into master (e7eadca) will decrease coverage by 2.74%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master    #4242      +/-   ##
==========================================
- Coverage   94.44%   91.70%   -2.75%     
==========================================
  Files         750      783      +33     
  Lines      179375   184371    +4996     
==========================================
- Hits       169417   169071     -346     
- Misses       9958    15300    +5342

Impacted Files	Coverage Δ
src/layer.cpp	`46.00% <ø> (ø)`
src/mat.h	`89.87% <ø> (+0.05%)`	⬆️
src/cpu.cpp	`58.43% <66.66%> (-3.69%)`	⬇️
src/layer/x86/deformableconv2d_x86.cpp	`0.00% <0.00%> (-97.92%)`	⬇️
src/layer/deformableconv2d.cpp	`0.00% <0.00%> (-97.20%)`	⬇️
src/layer/arm/innerproduct_arm_asimdhp.cpp	`94.72% <0.00%> (-3.19%)`	⬇️
src/layer/x86/innerproduct_x86.cpp	`97.12% <0.00%> (-2.23%)`	⬇️
src/allocator.cpp	`75.79% <0.00%> (-1.26%)`	⬇️
src/layer/x86/convolution_winograd_dot_pack16.h	`96.89% <0.00%> (-0.86%)`	⬇️
... and 568 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

cmake/ncnn_add_layer.cmake

src/cpu.h

src/mat.h

src/layer_registry.h.in

nihui · 2022-11-11T10:42:44Z

Thanks for your contribution !

* remove duplicated newline (Tencent#4187) * remove duplicated newline (Tencent#4188) * optmize softmax arm neon (Tencent#4171) * [docs] Fix typo (Tencent#4201) * [Prelu x86] Finish intrinsic with elempack merged (Tencent#4177) * changed size of images for pretty formatting of page (Tencent#4193) * [Gelu x86] Finish intrinsic with elempack merged(fast version) (Tencent#4144) * Finish the gelu x86 intrinsics * Finish the fast tanh x86 simd impl * Ignore .xmake directory (Tencent#4212) * Bump pypa/cibuildwheel from 2.9.0 to 2.10.1 (Tencent#4207) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.9.0 to 2.10.1. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.9.0...v2.10.1) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * style: space alignment (Tencent#4217) * Ignore CMakeSettings.json, the Visual Studio CMake schema file (Tencent#4228) * RVV: use new interface for segment load/store & change word_type to size_t&add clang ci (part Tencent#4100) (Tencent#4118) * RVV: use size_t for vl * RVV: replace vsseg.v tuple type by using regex ----- search: vsseg([1-9])e(8|16|32)_v_(f|i|u)\2m(1|2|4|8)x\1$([ -~]+), vcreate_\3\2m\4x\1\(([ -~]+)$, vl\); substitute by: vsseg$1e$2_v_$3$2m$4($5, $6, vl); * RVV: replace vssseg.v tuple types by using regex --- search: vssseg([1-9])e(8|16|32)_v_f\2m1x\1$([ -~]+), vcreate_f\2m1x\1\(([ -~]+)$, vl\); substitute by: vssseg$1e$2_v_f$2m1($3, $4, vl); * RVV: replace vlseg.v tuple types in load/store * RVV: replace vloxseg2ei32.v tuple types * RVV: add a wrapper for old compilers * RVV: add segment load/store wrapper in pakcing * RVV: fix cmake test * RVV: make clang happy by dropping VLAs in sgemm * RVV: add clang cmake toolchain configure * RVV: add clang ci, riscv64-unknown-linux-gnu Co-authored-by: thelastlin <thelastlin@users.noreply.github.com> Co-authored-by: nihui <shuizhuyuanluo@126.com> * Bump pypa/cibuildwheel from 2.10.1 to 2.10.2 (Tencent#4220) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.10.1 to 2.10.2. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.10.1...v2.10.2) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * add c906 build ci (Tencent#4232) * Add benchmark result of T-Head TH1520 (Tencent#4240) `cpuinfo`: ``` isa : rv64imafdcvsu mmu : sv39 cpu-freq : 1.848Ghz cpu-icache : 64KB cpu-dcache : 64KB cpu-l2cache : 1MB cpu-tlb : 1024 4-ways cpu-cacheline : 64Bytes cpu-vector : 0.7.1 ``` Compiled with `-DCMAKE_TOOLCHAIN_FILE=../toolchains/c910-v240.toolchain.cmake -DCMAKE_BUILD_TYPE=release -DNCNN_OPENMP=OFF -DNCNN_THREADS=OFF -DNCNN_RUNTIME_CPU=OFF -DNCNN_RVV=ON -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON` Seems much worse than expected 🤔 * fix param parsing issue when layer/blob name exceeds 255 (Tencent#4236) * fix param parsing issue when layer/blob name exceeds 255 * apply code-format changes Co-authored-by: ZhangGe6 <ZhangGe6@users.noreply.github.com> * Memory Pool Improvement For Variadic Sized Inputs (Tencent#4190) * Simple miss count for better space efficiency * Simple double ended greedy; * Add size drop threshold setter; * set workspace allocator cr to zero as we had some sort of recylcing capability :P Co-authored-by: LinHeLurking <LinHeLurking@users.noreply.github.com> Co-authored-by: nihuini <nihuini@tencent.com> * docs: disable fp16 when wrong results encountered caused by overflow (Tencent#4248) * pnnx math operation (Tencent#4251) * more stricter armv7 fp16 and armv84 bf16 compiler check, fix Tencent#4147 fix Tencent#4222 (Tencent#4247) * modified the param axes of expanddims in modelwriter (Tencent#4259) * Add TH1520 (4*C910V) toolchain support. (Tencent#4267) * implement lstm proj_size (Tencent#4263) * Optimize x86 DeformableConv2D (Tencent#4128) * fix compile warning with gcc 9.1.0 including simplestl.h file (Tencent#4274) * fix compile warning with gcc 9.1.0 including simplestl.h file * apply code-format changes Co-authored-by: veahow <veahow@users.noreply.github.com> * add benchmark for rk3588 on rock5b (Tencent#4275) * linux-x64-cpu-gcc on tencent ci * implement layer feature disabled bit (Tencent#4278) * add elu vulkan operator (Tencent#4280) * fix tencent ci (Tencent#4277) * implement GLU and pnnx conversion (Tencent#4283) * Bump pypa/cibuildwheel from 2.10.2 to 2.11.1 (Tencent#4271) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.10.2 to 2.11.1. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.10.2...v2.11.1) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix pnnx softmax/normalize/slice negative axis conversion to ncnn (Tencent#4284) * pnnx glu batchindex aware conversion (Tencent#4285) * 1. Fix typo in readme (Tencent#4287) * x86 sse2/avx2 optimization for convolution sgemm/winograd int8 family (Tencent#4286) * pnnx skip dynamic size evaluation (Tencent#4291) * Fix linux build error(Tencent#4265) (Tencent#4294) Co-authored-by: wangyu <786794414@qq.com> * general cpu feature detection on macos/ios, enable bf16 and i8mm on a15 a16 and m2 (Tencent#4300) * x86 unified fc fp32/fp16s (Tencent#4303) * more fma * more transpose utility function * Bump pypa/cibuildwheel from 2.11.1 to 2.11.2 (Tencent#4308) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.11.1 to 2.11.2. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.11.1...v2.11.2) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * pnnx pytorch 1.13 (Tencent#4314) * fix Tencent#4315 (Tencent#4316) * get_physical_cpu_count api family (Tencent#4302) * get_physical_cpu_count api family * set default to physical big cpu * always treat smt core as big core * is_smt_cpu * get max freq mhz on windows * windows thread affinity * groupnorm 1d/2d/4d (Tencent#4312) * fix slice end index, fix fp16 model weight alignment (Tencent#4317) * tencent ci test-coverage pnnx (Tencent#4305) * RVV: BatchNorm with fp16s(a) support (Tencent#4075) * RVV: InstanceNorm with fp16s(a) support (Tencent#4078) * fix ci pnnx build * fold new_full and full_like (Tencent#4323) * pnnx convert nn.Softmax2d (Tencent#4324) * pnnx convert fold unfold (Tencent#4325) * support yolov5 6.2 (Tencent#4328) * implement ncnn fold and unfold (Tencent#4326) * pnnx load gpu torchscript and reset device (Tencent#4330) * fix:pnnx-softmax (Tencent#4333) * pnnx save onnx zero (Tencent#4077) * save foldable constants in file for reducing memory usage (Tencent#4337) * match inplace slice copy pattern, rewrite copy uses (Tencent#4338) * add vector optimization for loongarch64 (Tencent#4242) * ci loongarch64 lsx (Tencent#4344) * gridsample op support (Tencent#4288) Co-authored-by: LRY89757 <LRY89757@users.noreply.github.com> Co-authored-by: nihuini <nihuini@tencent.com> Co-authored-by: nihui <shuizhuyuanluo@126.com> * squeeze and expanddims 4d (Tencent#4346) * implement MultiheadAttention kdim vdim (Tencent#4347) * pnnx convert torch bitwise left_shift right_shift (Tencent#4349) * pnnx fp16 option for ncnn and onnx weight type (Tencent#4350) * pnnx fuse more function to module (Tencent#4351) * pnnx fuse more function to module * rename some pass name * fuse adjacent reshape, fuse pad conv2d * fuse pad conv1d * split tests (Tencent#4354) * Support mat.numpy() in Python (Tencent#4356) * Fix typo in stb_image.h (Tencent#4358) exitting -> exiting * Fix windows-arm64 build for non-neon case (Tencent#4227) * update release ci (Tencent#4359) * update release ci * find modern glslang * parallel jobs on windows * Fix c api allocator (Tencent#4360) * add some c_api interfaces related to allocator setup. * fix errors in allocator parameters in c_api. * test c api allocator Co-authored-by: zhangtongshe <yuyuyezi@vip.qq.com> * update glslang (Tencent#4361) * disable out-of-line atomics since ndk23+ for resolving linking issue with old ndk (Tencent#4362) * I added one more project to the list of examples. (Tencent#4205) * Dedicated to coloring black and white photographs. * add example project link (Tencent#4365) * fix(pybind11): build error (Tencent#4368) * fix openmp affinity abort when cpu goes offline (Tencent#4370) * Update release-python.yml * small fixes * unpack list input * Remove LSTM2 * fix LSTM Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Menci <huanghaorui301@gmail.com> Co-authored-by: luqiang guo <702572275@qq.com> Co-authored-by: Lry89757 <77330637+LRY89757@users.noreply.github.com> Co-authored-by: magicse <magicse@users.noreply.github.com> Co-authored-by: Zhuo Zhang <imzhuo@foxmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: 汤圆奶昔 <47135403+tonori@users.noreply.github.com> Co-authored-by: Xavier Hsinyuan <me@lstlx.com> Co-authored-by: thelastlin <thelastlin@users.noreply.github.com> Co-authored-by: nihui <shuizhuyuanluo@126.com> Co-authored-by: 柚木鉉 <740291272@qq.com> Co-authored-by: Zhang Ge <sjtu.zg123@gmail.com> Co-authored-by: ZhangGe6 <ZhangGe6@users.noreply.github.com> Co-authored-by: LinHe <LinHe.Lurking@gmail.com> Co-authored-by: LinHeLurking <LinHeLurking@users.noreply.github.com> Co-authored-by: nihuini <nihuini@tencent.com> Co-authored-by: MisakaBit <MisakaBit@gmail.com> Co-authored-by: LiuYi-Up <73060646+LiuYi-Up@users.noreply.github.com> Co-authored-by: 陸言 <robinluaa@outlook.com> Co-authored-by: miemie2013 <53960695+miemie2013@users.noreply.github.com> Co-authored-by: Eahow Chen <15228088+veahow@users.noreply.github.com> Co-authored-by: veahow <veahow@users.noreply.github.com> Co-authored-by: li mengyang <hwdefcom@outlook.com> Co-authored-by: Yoh <wpz_yoh@163.com> Co-authored-by: Caize Wu <zepanwucai@gmail.com> Co-authored-by: bestpower <wangyu117136@gmail.com> Co-authored-by: wangyu <786794414@qq.com> Co-authored-by: shaoshengsong <30892500+shaoshengsong@users.noreply.github.com> Co-authored-by: WuJinxuan <2456510228@qq.com> Co-authored-by: junchao-loongson <68935141+junchao-loongson@users.noreply.github.com> Co-authored-by: LRY89757 <LRY89757@users.noreply.github.com> Co-authored-by: Ikko Ashimine <eltociear@gmail.com> Co-authored-by: zhangtongshe <yuyuyezi@vip.qq.com> Co-authored-by: tpoisonooo <khj.application@aliyun.com>

add vector optimization for loongarch64

eff3d3f

apply code-format changes

33a2cb9

nihui requested changes Oct 8, 2022

View reviewed changes

src/layer/loongarch64/absval_loongarch64.cpp Outdated Show resolved Hide resolved

src/layer/loongarch64/absval_loongarch64.cpp Outdated Show resolved Hide resolved

junchao-loongson and others added 2 commits October 11, 2022 09:29

update

2113477

apply code-format changes

bc8ef74

junchao-loongson and others added 7 commits October 17, 2022 09:57

fix requantize

aa70269

apply code-format changes

3eeb5d0

fix reduce add

db1baec

Adjusting the order of fmadd parameters

6df649e

format code

47852e1

Merge branch 'master' of github.com:junchao-loongson/ncnn

8bf2f84

apply code-format changes

5334319

junchao-loongson requested a review from nihui October 27, 2022 12:49

nihui requested changes Oct 28, 2022

View reviewed changes

junchao-loongson added 2 commits October 29, 2022 11:27

update 1029

d539bab

Merge branch 'master' of github.com:junchao-loongson/ncnn

c5053b3

junchao-loongson and others added 4 commits November 2, 2022 18:02

apply code-format changes

0e8fc1d

change v4f32 to m128

f45467f

Merge branch 'master' of github.com:junchao-loongson/ncnn

7477d5a

apply code-format changes

e4f7e2a

junchao-loongson requested a review from nihui November 3, 2022 01:26

nihui requested changes Nov 3, 2022

View reviewed changes

cmake/ncnn_add_layer.cmake Outdated Show resolved Hide resolved

junchao-loongson added 2 commits November 3, 2022 15:36

update loongarch64

6d4dd5e

Merge branch 'master' of github.com:junchao-loongson/ncnn

e095ef3

nihui requested changes Nov 4, 2022

View reviewed changes

cmake/ncnn_add_layer.cmake Show resolved Hide resolved

src/cpu.h Outdated Show resolved Hide resolved

src/mat.h Outdated Show resolved Hide resolved

src/mat.h Outdated Show resolved Hide resolved

src/layer_registry.h.in Outdated Show resolved Hide resolved

junchao-loongson added 3 commits November 4, 2022 14:14

swap vbitsel args

11d5106

update

9659563

update v2i64

a29f701

nihui merged commit 279222c into Tencent:master Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add vector optimization for loongarch64 #4242

add vector optimization for loongarch64 #4242

junchao-loongson commented Oct 8, 2022

tencent-adm commented Oct 8, 2022 •

edited

Loading

tpoisonooo commented Oct 8, 2022

nihui commented Oct 8, 2022

junchao-loongson commented Oct 9, 2022

junchao-loongson commented Oct 11, 2022

junchao-loongson commented Oct 27, 2022 •

edited

Loading

Yoh-Z commented Oct 27, 2022

nihui Oct 28, 2022

junchao98 Oct 29, 2022

codecov-commenter commented Oct 29, 2022 •

edited

Loading

nihui commented Nov 11, 2022

add vector optimization for loongarch64 #4242

add vector optimization for loongarch64 #4242

Conversation

junchao-loongson commented Oct 8, 2022

tencent-adm commented Oct 8, 2022 • edited Loading

tpoisonooo commented Oct 8, 2022

nihui commented Oct 8, 2022

junchao-loongson commented Oct 9, 2022

junchao-loongson commented Oct 11, 2022

junchao-loongson commented Oct 27, 2022 • edited Loading

Yoh-Z commented Oct 27, 2022

nihui Oct 28, 2022

Choose a reason for hiding this comment

junchao98 Oct 29, 2022

Choose a reason for hiding this comment

codecov-commenter commented Oct 29, 2022 • edited Loading

Codecov Report

nihui commented Nov 11, 2022

tencent-adm commented Oct 8, 2022 •

edited

Loading

junchao-loongson commented Oct 27, 2022 •

edited

Loading

codecov-commenter commented Oct 29, 2022 •

edited

Loading