Memory Pool Improvement For Variadic Sized Inputs #4190

LinHeLurking · 2022-09-07T15:40:50Z

A Better Memory Pool With Simple Greedy Strategy

This pull request contains several modifications on current ncnn pool allocators.

Basic Assumption

The program runs in multiple stages. In each stage, memory sizes the program is asking for are lied in an (relatively small) interval. When program goes into a new stage, the size interval will change. The intervals may or may not overlap.

Double Ended Greedy Strategy

If all cached chunks in the pool cannot satisfy current allocation, the allocator tries to remove an outdated chunk.

If this allocation size is larger than any previous chunk, remove the smallest chunk (see below).

                NEW
           ^============^
        ___^_____       ^
       |   ^     |      ^
----------------------------------------
         OLD        ^    
                    |
                    |
                ALLOCATION

If this allocation size is smaller than any previous chunk, remove the largest chunk.
If this allocation size is between the largest and smallest, do not remove.

Simulated Results

Following codes can be used to simulate a multiple stage running scenario.

int thrd_share_allocator_test(int repeat, ncnn::Allocator* allocator)
{
    // Fix seed for better reproducibility.
    std::mt19937 rng(0xbadf00d);
    std::uniform_int_distribution<size_t> dist;

    std::deque<ncnn::Mat> occupied;
    int batch_size = 30;
    int batch_num = repeat;
    for (int batch_id = 0; batch_id < batch_num; ++batch_id)
    {
        // Centroid and radius absolute size does not matter.
        // You can scale them arbitrarily.
        size_t centroid = std::max(1UL, dist(rng)) % 100000000;              // max centroid is 100 MB
        size_t radius = std::min(centroid, std::min(dist(rng), 30000000UL)); // max radius is 30 MB

        std::uniform_int_distribution<size_t> size_dist(centroid - radius, centroid + radius);
        std::vector<void*> allocated;
        for (int i = 0; i < batch_size; ++i)
        {
            size_t size = size_dist(rng) & ~0x13FFF; // minimal step is 5 KB
            void* ptr = allocator->fastMalloc(size);
            allocated.push_back(ptr);
        }
        std::this_thread::sleep_for(std::chrono::milliseconds(500));
        for (void* ptr : allocated)
        {
            allocator->fastFree(ptr);
        }
    }
    return 0;
}

int test_allocator(int thrd_num = 4, int repeat = int(1e1))
{
    auto allocator = new ncnn::PoolAllocator;
    std::vector<std::future<int> > futures;
    for (int i = 0; i < thrd_num; ++i)
    {
        futures.emplace_back(std::async(thrd_share_allocator_test, repeat, allocator));
    }
    int flag = 0;
    for (auto& future : futures)
    {
        flag |= future.get();
    }
    return flag;
}

int main()
{
    return 0 || test_allocator();
}

Memory usage:
(Memory usage is gathered with valgrind: valgrind --tool=massif --time-unit=ms <executable>)

…something strange in packing layout;

LinHeLurking · 2022-09-07T15:47:14Z

I'm not sure why old commits occur here. Actually only cee9d22 is the thing I did the magic.
😆

codecov-commenter · 2022-09-07T15:50:45Z

Codecov Report

Merging #4190 (4066294) into master (479a73a) will decrease coverage by 0.18%.
The diff coverage is 48.57%.

@@            Coverage Diff             @@
##           master    #4190      +/-   ##
==========================================
- Coverage   94.43%   94.25%   -0.19%     
==========================================
  Files         749      750       +1     
  Lines      179049   179405     +356     
==========================================
+ Hits       169091   169094       +3     
- Misses       9958    10311     +353

Impacted Files	Coverage Δ
src/allocator.h	`87.50% <ø> (ø)`
src/allocator.cpp	`75.11% <47.05%> (-1.94%)`	⬇️
src/net.cpp	`65.37% <100.00%> (ø)`
src/command.cpp	`72.70% <0.00%> (-14.94%)`	⬇️
src/pipeline.cpp	`58.69% <0.00%> (-2.18%)`	⬇️
src/layer/vulkan/reshape_vulkan.cpp	`92.01% <0.00%> (-2.14%)`	⬇️
src/layer/vulkan/packing_vulkan.cpp	`81.70% <0.00%> (-1.88%)`	⬇️
src/layer/vulkan/permute_vulkan.cpp	`96.99% <0.00%> (-1.60%)`	⬇️
src/layer/vulkan/reorg_vulkan.cpp	`96.35% <0.00%> (-1.57%)`	⬇️
src/layer/vulkan/pixelshuffle_vulkan.cpp	`96.35% <0.00%> (-1.57%)`	⬇️
... and 24 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

nihui · 2022-09-13T04:38:56Z

please investigate sanitizer error

LinHeLurking · 2022-09-23T11:39:24Z

I've tried but cannot reproduce the error of test_squeezenet. Is there any way to run CTest with extra checks on?

nihui · 2022-09-23T22:23:45Z

cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_ASAN=ON -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..

LRY89757 · 2022-09-24T01:48:07Z

I've tried but cannot reproduce the error of test_squeezenet. Is there any way to run CTest with extra checks on?

May be you can refer to this line about this file linux-x64-cpu-gcc-san.yml :)

LinHeLurking · 2022-09-26T08:38:59Z

(First of all, sorry for late response.)

It turns out that ASAN switch of CMAKE is ignored if it is set in VS Code setting.json file. (But why ???)
After all, I've fixed the sanitizer error. 😆

nihui · 2022-10-01T14:01:49Z

src/allocator.cpp

@@ -33,6 +33,7 @@ class PoolAllocatorPrivate
    Mutex budgets_lock;
    Mutex payouts_lock;
    unsigned int size_compare_ratio; // 0~256
+    static const size_t size_threshold = 10;


size_threshold does not seem expressive enough
budget_count_threshold or budget_drop_threshold might be better?

How about adding a setter api that allows the user to control whether a more aggressive or conservative budget recycling strategy is required?

Renamed & Added a setter.

nihui · 2022-10-01T14:05:23Z

src/allocator.cpp

@@ -104,11 +105,20 @@ void* PoolAllocator::fastMalloc(size_t size)
    d->budgets_lock.lock();

    // find free budget
-    std::list<std::pair<size_t, void*> >::iterator it = d->budgets.begin();
+    std::list<std::pair<size_t, void*> >::iterator it = d->budgets.begin(), it_max = d->budgets.end(), it_min = d->budgets.end();


Why are these max/min iterators initialized to end() instead of begin() ?
If the budgets are empty, if (d->budgets.size() >= d->size_threshold) will always be false.

Yes it doesn't matter.

…apability :P

nihui · 2022-10-09T02:42:56Z

Thanks for your contribution !

* remove duplicated newline (Tencent#4187) * remove duplicated newline (Tencent#4188) * optmize softmax arm neon (Tencent#4171) * [docs] Fix typo (Tencent#4201) * [Prelu x86] Finish intrinsic with elempack merged (Tencent#4177) * changed size of images for pretty formatting of page (Tencent#4193) * [Gelu x86] Finish intrinsic with elempack merged(fast version) (Tencent#4144) * Finish the gelu x86 intrinsics * Finish the fast tanh x86 simd impl * Ignore .xmake directory (Tencent#4212) * Bump pypa/cibuildwheel from 2.9.0 to 2.10.1 (Tencent#4207) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.9.0 to 2.10.1. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.9.0...v2.10.1) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * style: space alignment (Tencent#4217) * Ignore CMakeSettings.json, the Visual Studio CMake schema file (Tencent#4228) * RVV: use new interface for segment load/store & change word_type to size_t&add clang ci (part Tencent#4100) (Tencent#4118) * RVV: use size_t for vl * RVV: replace vsseg.v tuple type by using regex ----- search: vsseg([1-9])e(8|16|32)_v_(f|i|u)\2m(1|2|4|8)x\1$([ -~]+), vcreate_\3\2m\4x\1\(([ -~]+)$, vl\); substitute by: vsseg$1e$2_v_$3$2m$4($5, $6, vl); * RVV: replace vssseg.v tuple types by using regex --- search: vssseg([1-9])e(8|16|32)_v_f\2m1x\1$([ -~]+), vcreate_f\2m1x\1\(([ -~]+)$, vl\); substitute by: vssseg$1e$2_v_f$2m1($3, $4, vl); * RVV: replace vlseg.v tuple types in load/store * RVV: replace vloxseg2ei32.v tuple types * RVV: add a wrapper for old compilers * RVV: add segment load/store wrapper in pakcing * RVV: fix cmake test * RVV: make clang happy by dropping VLAs in sgemm * RVV: add clang cmake toolchain configure * RVV: add clang ci, riscv64-unknown-linux-gnu Co-authored-by: thelastlin <thelastlin@users.noreply.github.com> Co-authored-by: nihui <shuizhuyuanluo@126.com> * Bump pypa/cibuildwheel from 2.10.1 to 2.10.2 (Tencent#4220) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.10.1 to 2.10.2. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.10.1...v2.10.2) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * add c906 build ci (Tencent#4232) * Add benchmark result of T-Head TH1520 (Tencent#4240) `cpuinfo`: ``` isa : rv64imafdcvsu mmu : sv39 cpu-freq : 1.848Ghz cpu-icache : 64KB cpu-dcache : 64KB cpu-l2cache : 1MB cpu-tlb : 1024 4-ways cpu-cacheline : 64Bytes cpu-vector : 0.7.1 ``` Compiled with `-DCMAKE_TOOLCHAIN_FILE=../toolchains/c910-v240.toolchain.cmake -DCMAKE_BUILD_TYPE=release -DNCNN_OPENMP=OFF -DNCNN_THREADS=OFF -DNCNN_RUNTIME_CPU=OFF -DNCNN_RVV=ON -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON` Seems much worse than expected 🤔 * fix param parsing issue when layer/blob name exceeds 255 (Tencent#4236) * fix param parsing issue when layer/blob name exceeds 255 * apply code-format changes Co-authored-by: ZhangGe6 <ZhangGe6@users.noreply.github.com> * Memory Pool Improvement For Variadic Sized Inputs (Tencent#4190) * Simple miss count for better space efficiency * Simple double ended greedy; * Add size drop threshold setter; * set workspace allocator cr to zero as we had some sort of recylcing capability :P Co-authored-by: LinHeLurking <LinHeLurking@users.noreply.github.com> Co-authored-by: nihuini <nihuini@tencent.com> * docs: disable fp16 when wrong results encountered caused by overflow (Tencent#4248) * pnnx math operation (Tencent#4251) * more stricter armv7 fp16 and armv84 bf16 compiler check, fix Tencent#4147 fix Tencent#4222 (Tencent#4247) * modified the param axes of expanddims in modelwriter (Tencent#4259) * Add TH1520 (4*C910V) toolchain support. (Tencent#4267) * implement lstm proj_size (Tencent#4263) * Optimize x86 DeformableConv2D (Tencent#4128) * fix compile warning with gcc 9.1.0 including simplestl.h file (Tencent#4274) * fix compile warning with gcc 9.1.0 including simplestl.h file * apply code-format changes Co-authored-by: veahow <veahow@users.noreply.github.com> * add benchmark for rk3588 on rock5b (Tencent#4275) * linux-x64-cpu-gcc on tencent ci * implement layer feature disabled bit (Tencent#4278) * add elu vulkan operator (Tencent#4280) * fix tencent ci (Tencent#4277) * implement GLU and pnnx conversion (Tencent#4283) * Bump pypa/cibuildwheel from 2.10.2 to 2.11.1 (Tencent#4271) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.10.2 to 2.11.1. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.10.2...v2.11.1) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix pnnx softmax/normalize/slice negative axis conversion to ncnn (Tencent#4284) * pnnx glu batchindex aware conversion (Tencent#4285) * 1. Fix typo in readme (Tencent#4287) * x86 sse2/avx2 optimization for convolution sgemm/winograd int8 family (Tencent#4286) * pnnx skip dynamic size evaluation (Tencent#4291) * Fix linux build error(Tencent#4265) (Tencent#4294) Co-authored-by: wangyu <786794414@qq.com> * general cpu feature detection on macos/ios, enable bf16 and i8mm on a15 a16 and m2 (Tencent#4300) * x86 unified fc fp32/fp16s (Tencent#4303) * more fma * more transpose utility function * Bump pypa/cibuildwheel from 2.11.1 to 2.11.2 (Tencent#4308) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.11.1 to 2.11.2. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.11.1...v2.11.2) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * pnnx pytorch 1.13 (Tencent#4314) * fix Tencent#4315 (Tencent#4316) * get_physical_cpu_count api family (Tencent#4302) * get_physical_cpu_count api family * set default to physical big cpu * always treat smt core as big core * is_smt_cpu * get max freq mhz on windows * windows thread affinity * groupnorm 1d/2d/4d (Tencent#4312) * fix slice end index, fix fp16 model weight alignment (Tencent#4317) * tencent ci test-coverage pnnx (Tencent#4305) * RVV: BatchNorm with fp16s(a) support (Tencent#4075) * RVV: InstanceNorm with fp16s(a) support (Tencent#4078) * fix ci pnnx build * fold new_full and full_like (Tencent#4323) * pnnx convert nn.Softmax2d (Tencent#4324) * pnnx convert fold unfold (Tencent#4325) * support yolov5 6.2 (Tencent#4328) * implement ncnn fold and unfold (Tencent#4326) * pnnx load gpu torchscript and reset device (Tencent#4330) * fix:pnnx-softmax (Tencent#4333) * pnnx save onnx zero (Tencent#4077) * save foldable constants in file for reducing memory usage (Tencent#4337) * match inplace slice copy pattern, rewrite copy uses (Tencent#4338) * add vector optimization for loongarch64 (Tencent#4242) * ci loongarch64 lsx (Tencent#4344) * gridsample op support (Tencent#4288) Co-authored-by: LRY89757 <LRY89757@users.noreply.github.com> Co-authored-by: nihuini <nihuini@tencent.com> Co-authored-by: nihui <shuizhuyuanluo@126.com> * squeeze and expanddims 4d (Tencent#4346) * implement MultiheadAttention kdim vdim (Tencent#4347) * pnnx convert torch bitwise left_shift right_shift (Tencent#4349) * pnnx fp16 option for ncnn and onnx weight type (Tencent#4350) * pnnx fuse more function to module (Tencent#4351) * pnnx fuse more function to module * rename some pass name * fuse adjacent reshape, fuse pad conv2d * fuse pad conv1d * split tests (Tencent#4354) * Support mat.numpy() in Python (Tencent#4356) * Fix typo in stb_image.h (Tencent#4358) exitting -> exiting * Fix windows-arm64 build for non-neon case (Tencent#4227) * update release ci (Tencent#4359) * update release ci * find modern glslang * parallel jobs on windows * Fix c api allocator (Tencent#4360) * add some c_api interfaces related to allocator setup. * fix errors in allocator parameters in c_api. * test c api allocator Co-authored-by: zhangtongshe <yuyuyezi@vip.qq.com> * update glslang (Tencent#4361) * disable out-of-line atomics since ndk23+ for resolving linking issue with old ndk (Tencent#4362) * I added one more project to the list of examples. (Tencent#4205) * Dedicated to coloring black and white photographs. * add example project link (Tencent#4365) * fix(pybind11): build error (Tencent#4368) * fix openmp affinity abort when cpu goes offline (Tencent#4370) * Update release-python.yml * small fixes * unpack list input * Remove LSTM2 * fix LSTM Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Menci <huanghaorui301@gmail.com> Co-authored-by: luqiang guo <702572275@qq.com> Co-authored-by: Lry89757 <77330637+LRY89757@users.noreply.github.com> Co-authored-by: magicse <magicse@users.noreply.github.com> Co-authored-by: Zhuo Zhang <imzhuo@foxmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: 汤圆奶昔 <47135403+tonori@users.noreply.github.com> Co-authored-by: Xavier Hsinyuan <me@lstlx.com> Co-authored-by: thelastlin <thelastlin@users.noreply.github.com> Co-authored-by: nihui <shuizhuyuanluo@126.com> Co-authored-by: 柚木鉉 <740291272@qq.com> Co-authored-by: Zhang Ge <sjtu.zg123@gmail.com> Co-authored-by: ZhangGe6 <ZhangGe6@users.noreply.github.com> Co-authored-by: LinHe <LinHe.Lurking@gmail.com> Co-authored-by: LinHeLurking <LinHeLurking@users.noreply.github.com> Co-authored-by: nihuini <nihuini@tencent.com> Co-authored-by: MisakaBit <MisakaBit@gmail.com> Co-authored-by: LiuYi-Up <73060646+LiuYi-Up@users.noreply.github.com> Co-authored-by: 陸言 <robinluaa@outlook.com> Co-authored-by: miemie2013 <53960695+miemie2013@users.noreply.github.com> Co-authored-by: Eahow Chen <15228088+veahow@users.noreply.github.com> Co-authored-by: veahow <veahow@users.noreply.github.com> Co-authored-by: li mengyang <hwdefcom@outlook.com> Co-authored-by: Yoh <wpz_yoh@163.com> Co-authored-by: Caize Wu <zepanwucai@gmail.com> Co-authored-by: bestpower <wangyu117136@gmail.com> Co-authored-by: wangyu <786794414@qq.com> Co-authored-by: shaoshengsong <30892500+shaoshengsong@users.noreply.github.com> Co-authored-by: WuJinxuan <2456510228@qq.com> Co-authored-by: junchao-loongson <68935141+junchao-loongson@users.noreply.github.com> Co-authored-by: LRY89757 <LRY89757@users.noreply.github.com> Co-authored-by: Ikko Ashimine <eltociear@gmail.com> Co-authored-by: zhangtongshe <yuyuyezi@vip.qq.com> Co-authored-by: tpoisonooo <khj.application@aliyun.com>

LinHeLurking and others added 30 commits July 18, 2022 20:59

A LayerNorm_x86 class mocking LayerNorm for tests;

f5783b5

All SIMD optimizations success wihout support_packing; Maybe there's …

7a94b1a

…something strange in packing layout;

Located error about packed layout.

b126e0f

All test passed; Now it supports packing layout

1982605

Fix runtime cpu dispatch;

0fa8689

Use fmadd wrapper in x86_usability.h;

6a683d3

Merge packed & unpacked code.

bf95312

Func rename.

af97b05

Simplify and merge more branches about packed layout;

a9be63a

Code format

976692a

Replace some member functions with static inline functions.

d7007c3

Add copyright header

508d143

apply code-format changes

cf015d8

Add more tests with 16 packed for AVX512

5084955

Code format

48fb4ea

Merge branch 'master' of https://github.com/Tencent/ncnn

8c4ed97

Merge branch 'master' of https://github.com/LinHeLurking/ncnn

3c2c1c8

Copyright statement year fixed

487568d

Fix accidentally added corelation of mean/var and SIMD ISA

23db5ab

Fix accidentally added corelation of fmadd/affine_fmadd and SIMD ISA

72777b4

Fix a wrong test param

b20d298

Fix runtime dispatch

4fddf9e

apply code-format changes

2555b3e

no store duplicates

1b118f7

Merge branch 'Tencent:master' into master

649a63b

Simple miss count for better space efficiency

49a5b7f

Change miss count position;

2ecda04

Simple double ended greedy;

cee9d22

Merge branch 'Tencent:master' into master

3f2731f

Merge branch 'master' into mem-pool

577d593

Fix santinizer bug;

ee7ca30

nihui reviewed Oct 1, 2022

View reviewed changes

LinHeLurking and others added 3 commits October 6, 2022 08:41

Change it_max/it_min initial value;

d4b6698

Add size drop threshold setter;

4a0a1a9

set workspace allocator cr to zero as we had some sort of recylcing c…

4066294

…apability :P

nihui merged commit 9426e21 into Tencent:master Oct 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Pool Improvement For Variadic Sized Inputs #4190

Memory Pool Improvement For Variadic Sized Inputs #4190

LinHeLurking commented Sep 7, 2022

LinHeLurking commented Sep 7, 2022

codecov-commenter commented Sep 7, 2022 •

edited

Loading

nihui commented Sep 13, 2022

LinHeLurking commented Sep 23, 2022

nihui commented Sep 23, 2022

LRY89757 commented Sep 24, 2022 •

edited

Loading

LinHeLurking commented Sep 26, 2022

nihui Oct 1, 2022

LinHeLurking Oct 6, 2022

nihui Oct 1, 2022

LinHeLurking Oct 6, 2022

nihui commented Oct 9, 2022

Memory Pool Improvement For Variadic Sized Inputs #4190

Memory Pool Improvement For Variadic Sized Inputs #4190

Conversation

LinHeLurking commented Sep 7, 2022

A Better Memory Pool With Simple Greedy Strategy

Basic Assumption

Double Ended Greedy Strategy

Simulated Results

LinHeLurking commented Sep 7, 2022

codecov-commenter commented Sep 7, 2022 • edited Loading

Codecov Report

nihui commented Sep 13, 2022

LinHeLurking commented Sep 23, 2022

nihui commented Sep 23, 2022

LRY89757 commented Sep 24, 2022 • edited Loading

LinHeLurking commented Sep 26, 2022

nihui Oct 1, 2022

Choose a reason for hiding this comment

LinHeLurking Oct 6, 2022

Choose a reason for hiding this comment

nihui Oct 1, 2022

Choose a reason for hiding this comment

LinHeLurking Oct 6, 2022

Choose a reason for hiding this comment

nihui commented Oct 9, 2022

codecov-commenter commented Sep 7, 2022 •

edited

Loading

LRY89757 commented Sep 24, 2022 •

edited

Loading