RFC: various optimization targeting 'rename' and "compact' (but not only...) #621

invoxiaamo · 2021-12-01T14:50:35Z

Hello.
While we are working with littlefs for few months, we experience few long stall of the system during up to 30s, despite having a solid hardware (250+ MHz ARM M33, lot of RAM, 50 MB/s of NOR flash bus speed).

Our investigation quickly focused on the rename operations that are used for atomic file substitution.
The issues are similar to #327.

In order to easily reproduce the issue we found the following scenario which is very representative of the problem:

    - format and mount a 4KB block FS
    - create 100 files of 256 Bytes named "/dummy_%d"
    - create a 1024 Byte file "/test"
    - rename "/test" "/test_rename"
    - create a 1024 Byte file "/test"
    - rename "/test" "/test_rename"  [1]

The second rename [1] trigger a compaction of the dir.
We measure various performance counters:

lfs_dir_traverse is called 148 393 times
lfs_bd_read is called 25 205 444 times (for 100MB+ of data requested)
With current cache implementation, 13 872 154 SPI read transfers where done (222MB+ of data transfers).
THIS IS HUGE for a 102 files / 28KB filled FS....

We provide here 2 cache improvements and a simple but very effective lfs_dir_traverse optimization.
All the optimizations are independent.

The cache "read ahead" optimization simply request larger rcache fill when it detects that reads are continuous.
=> 66% reduction if IO operations
=> 20% increase of data transfers on SPI
=> 33% overall time reduction of SPI transfers (since the effective data transfers is only a small part of the SPI command sequence)
A "dynamic" block caching (enabled with LFS_DYN_CACHE define) to malloc a whole block when we detect random consecutive read on a single block. the RAM allocated is freed as soon as the user lfs_xxxxx operation returns. Consequently it should not increase heap usage as long as there is a single block temporary available in RAM.
=> 99.997 % reduction of IO operations (compared to (1)) - only 123 remaining block read vs. 13872154
=> 99.993 % reduction of IO data transfer - only 15KB vs. 222MB.
When RAM (eg. 4KB for our plaform) is not an issue, this is a very interesting optimization.
Yet, despite (2), the overall CPU time only decrease by 75%, leaving the huge number of times lfs_dir_traverse is called.
We analyze the call graph and saw that 4 levels of recursive calls of lfs_dir_traverse are done during compaction.
A O(n^4) is obviously a problem...
The 4 level is due to the presence of LFS_FROM_MOVE tag is attribute walk list.

 [1] lfs_dir_traverse(cb=lfs_dir_commit_size)
     - do 'duplicates and tag update'
   [2] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1])
       - Reaching a LFS_FROM_MOVE tag
     [3] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) <= on 'from' dir
         - do 'duplicates and tag update'
       [4] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[3])

Yet, for the particular case of cb=lfs_dir_traverse_filter for [3] and [4], we saw that all the calls are useless

can't raise 'duplicate' trigger of [2]
don't update tag[1] of level [2]

The shortcut of this particular case reduce the number of lfs_dir_traverse calls by 97.8% (3248 vs 148393).

With all the 3 optimizations, the overall compaction time on our platform drop by 98.3%

Monitoring the lfs_bd_read usage during time consuming operations (eg. rename) shows a very large of uncache reads on small (less than cache size) consecutive offsets. Implementing a read ahead strategy when such consecutive reads are detective is an effective optimization.

…domly

Enabled with LFS_PERF_STATS define

Rename can be VERY time consuming. One of the reasons is the 4 recursion level depth of lfs_dir_traverse() seen if a compaction happened during the rename. lfs_dir_compact() size computation [1] lfs_dir_traverse(cb=lfs_dir_commit_size) - do 'duplicates and tag update' [2] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) - Reaching a LFS_FROM_MOVE tag (here) [3] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) <= on 'from' dir - do 'duplicates and tag update' [4] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[3]) followed by the compaction itself: [1] lfs_dir_traverse(cb=lfs_dir_commit_commit) - do 'duplicates and tag update' [2] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) - Reaching a LFS_FROM_MOVE tag (here) [3] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) <= on 'from' dir - do 'duplicates and tag update' [4] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[3]) Yet, analyse shows that levels [3] and [4] don't perform anything if the callback is lfs_dir_traverse_filter... A practical example: - format and mount a 4KB block FS - create 100 files of 256 Bytes named "/dummy_%d" - create a 1024 Byte file "/test" - rename "/test" "/test_rename" - create a 1024 Byte file "/test" - rename "/test" "/test_rename" This triggers a compaction where lfs_dir_traverse was called 148393 times, generating 25e6+ lfs_bd_read calls (~100 MB+ of data) With the optimization, lfs_dir_traverse is now called 3248 times (589e3 lfs_bds_calls (~2.3MB of data) => x 43 improvement...

genotix · 2021-12-01T19:08:57Z

Dude this is seriously -amazing- work!!
Huge thumbs up!!!

dpgeorge · 2021-12-10T06:50:33Z

This sounds really great!

Do you think it's possible to implement optimisation (2) (and possibly (1) as well) at the block device layer? That is, not within littlefs itself but rather within the read/prog calls defined by the user?

invoxiaamo · 2021-12-10T08:44:12Z

@dpgeorge

(2) is really an small improvement of the current cache. Either is is integrated as a fix of the current one, either it can be skipped.
(1) can obviously be implemented in the flash driver, but the interesting part of the current implementation is that malloc memory is released as soon as the lfs_open/write/rename... call return. It means that the memory is not steal and doesn't increase the memory pressure of the embedded device.

invoxiaamo · 2021-12-10T09:09:08Z

@AlexanderCsr8904
Is your suggestion is to squash all commits of this pull request into one ?
In fact, this pull request is a RFC to collect reviews and ideas. I will create independent pull requests later, in particularly (3) which is the most important.
For (2), not everybody has the same needs. For some, memory (even temporary malloc) is an issue compared to performance (think FS only used to store few configuration files), while for other, speed is important. That's why (2) depends of the LFS_DYN_CACHE configuration define.
There is also the commit concerning performance statistics which is independent.

BenBE

Just two minor typos …

lfs.c

Co-authored-by: BenBE <BenBE@geshi.org>

GIldons

Fix dynamic cache initialization.

lfs.c

Co-authored-by: GIldons <raulgildons@gmail.com>

geky · 2022-04-09T05:40:46Z

Hi @invoxiaamo, this is a really great analysis and these are very creative solution! Thanks for creating this PR and sorry I wasn't able to properly look into it until now.

(3) seems immediately actionable. The original algorithm shouldn't be exceeding O(n^2), but I can reproduce what you're seeing, so it appears this was a mistake when translating the design to implementation.

Would you be able to create a separate PR for just (3)? We would be able to merge it into the next release then.

Sorry if you've already mentioned it and I'm just missing it, but what cache_size are you using?

I'm curious how these optimizations compare to setting cache_size = block_size. In theory at that point the full metadata block should fit in the cache avoiding the need to go to disk O(n^2) times.

(2) is unfortunately a bit incompatible with some of littlefs's design, as littlefs really only uses malloc for convenience functions wrapping optionally static buffers. This means you can use littlefs without malloc without sacrificing any of its features.

In theory you could actually allocate the dyn-block-cache's buffer on the stack, but dynamic stack allocations come with their own set of problems which we should avoid...

It may be reasonable to just add an optional third cache to the lfs_config struct. In practice temporary allocations like this are difficult to take advantage of in resource constrained systems because of the need to be prepared for the worst-case RAM consumption.

A more general solution, one that I would like to see happen at some point, would be to change the caching system in littlefs to be able to cache blocks > 2 if enough RAM is provided. This would in effect turn littlefs's cache layer into an actual block cache and I believe provide a similar solution as (2), while also allowing more RAM to cache more blocks (though probably with diminishing returns).

(1) is quite a smart idea, though eagerly fetching reads was the intention of the hint argument to lfs_bd_read. Do you think the performance gains of (1) could be accomplished by tuning lfs_bd_read's hint in the hot path?

You can see hint is set to block_size (the maximum) in most of lfs_dir_fetchmatch:

littlefs/lfs.c

Lines 904 to 906 in 9c7e232

    
           int err = lfs_bd_read(lfs, 
        
                   NULL, &lfs->rcache, lfs->cfg->block_size, 
        
                   dir->pair[0], off, &tag, sizeof(tag));

But it looks like this could be improved in other parts of the code path:

littlefs/lfs.c

Lines 767 to 769 in 9c7e232

    
           int err = lfs_bd_read(lfs, 
        
                   NULL, &lfs->rcache, sizeof(tag), 
        
                   dir->pair[0], off, &tag, sizeof(tag));

invoxiaamo · 2022-04-09T09:14:45Z

hello @geky
I just create PR #666 with only (3).
Regards,
Arnaud

invoxiaamo added 4 commits December 1, 2021 14:44

cache: implement a dynamic cache of the whole block when accessed ran…

caaf63b

…domly

utils: Measure performance of various IO and function calls

593b616

Enabled with LFS_PERF_STATS define

invoxiaamo mentioned this pull request Dec 1, 2021

lfs_rename sometimes very slow #327

Open

BenBE reviewed Dec 10, 2021

View reviewed changes

lfs.c Outdated Show resolved Hide resolved

lfs.c Outdated Show resolved Hide resolved

invoxiaamo and others added 2 commits December 10, 2021 17:06

Update lfs.c

ae1d454

Co-authored-by: BenBE <BenBE@geshi.org>

Update lfs.c

6f4fc80

Co-authored-by: BenBE <BenBE@geshi.org>

geky added performance enhancement labels Mar 21, 2022

dpgeorge mentioned this pull request Mar 21, 2022

Restructure littlefs to not use recursion, measure stack usage #658

Merged

GIldons suggested changes Mar 24, 2022

View reviewed changes

lfs.c Show resolved Hide resolved

Update lfs.c

bf33e69

Co-authored-by: GIldons <raulgildons@gmail.com>

geky removed the enhancement label Apr 9, 2022

geky added this to the v2.5 milestone Apr 9, 2022

geky added the enhancement label Apr 9, 2022

geky removed this from the v2.5 milestone Apr 9, 2022

geky added the needs minor version new functionality only allowed in minor versions label Apr 9, 2022

geky mentioned this pull request Apr 11, 2022

Minor release: v2.5 #669

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: various optimization targeting 'rename' and "compact' (but not only...) #621

RFC: various optimization targeting 'rename' and "compact' (but not only...) #621

invoxiaamo commented Dec 1, 2021 •

edited

genotix commented Dec 1, 2021

dpgeorge commented Dec 10, 2021

invoxiaamo commented Dec 10, 2021

invoxiaamo commented Dec 10, 2021

BenBE left a comment

GIldons left a comment

geky commented Apr 9, 2022

invoxiaamo commented Apr 9, 2022 •

edited

RFC: various optimization targeting 'rename' and "compact' (but not only...) #621

Are you sure you want to change the base?

RFC: various optimization targeting 'rename' and "compact' (but not only...) #621

Conversation

invoxiaamo commented Dec 1, 2021 • edited

genotix commented Dec 1, 2021

dpgeorge commented Dec 10, 2021

invoxiaamo commented Dec 10, 2021

invoxiaamo commented Dec 10, 2021

BenBE left a comment

Choose a reason for hiding this comment

GIldons left a comment

Choose a reason for hiding this comment

geky commented Apr 9, 2022

invoxiaamo commented Apr 9, 2022 • edited

invoxiaamo commented Dec 1, 2021 •

edited

invoxiaamo commented Apr 9, 2022 •

edited