Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about scalability and large files #75

Open
FreddieChopin opened this issue Jul 14, 2018 · 15 comments
Open

Question about scalability and large files #75

FreddieChopin opened this issue Jul 14, 2018 · 15 comments

Comments

@FreddieChopin
Copy link
Contributor

FreddieChopin commented Jul 14, 2018

First of all, this is not an "issue", just a question, as I'm not sure whether what I see is normal/expected or not.

I've ported littlefs to my C++ RTOS - https://github.com/DISTORTEC/distortos and I'm trying it out with an SDHC 32 GB card on STM32F429. My test more-or less goes like this:

  • dump flash memory directly to SD card (raw write speed) - 10 times,
  • format the card with littlefs,
  • write the same dump of flash memory to a littlefs file (file system write speed), which always gets truncated when opened - 10 times.

These 3 steps are repeated for write sizes from following sequence: 512 B, 1 kB, 2 kB, ... 512 kB, 1 MB, 2 MB.

After some runs I noticed that difference between speed of raw sequential write to the SD card (with no file system) vs. speed of write with the file system is quite huge, but only for "large" files. The difference grows when the size of file grows - for example when writing just 1 kB, the ratio is just ~2.7, for 128 kB this is already ~7.2, while for 2 MB it's ~14.6 (these are all for average values).

Below is a chart which shows my measurements:

image
(x axis - write size, y axis - ratio between average fs write speed vs average raw write speed)

Here's another interesting observation - in my test I do 10 runs of raw write, format the card with littlefs and do 10 runs of file system write. While the speed of raw writes are mostly identical, the speed of file system writes vary:

  • for small files - below and equal to 16 kB - the first write is actually the slowest one, while next 9 are a bit faster (for example 384 ms first write vs 320 ms second write),
  • for medium-sized files - 32 kB to 128 kB - the speed of file system writes is mostly consistent,
  • for large files - 256 kB to 2 MB - the first write is the fastest, following writes are slower, for 2 MB file the last 3 writes get even slower than earlier 6.

Each write goes to the same file, which I truncate to zero on creation.

Here's the chart of these findings:

image
(x axis - number of write, y axis - ratio write speed of this write vs the speed of first write)

I guess it's also worth noting that I use 512 bytes for read, program and erase block size, and 512 for "lookahead" value.

To summarise:

  1. Is this expected, that for large files the performance is more than 10x lower than raw performance of the SD card, while for small files the difference is much smaller?
  2. Is it expected that different writes (to the same, truncated file) have different speeds?

Maybe this is all related to the value of lookahead? I did not test it yet, as the test takes quite some time (write of 2 MB file takes 60-100 seconds), but maybe this is related? 512 blocks (each 512 B long) of lookahead would correspond roughly to a file size around 256 kB, which is the size where the mentioned effects start to appear...

Thanks in advance for your insight!

--

Below I paste my test code for reference if this may be useful.

int main()
{
  distortos::devices::SpiSdMmcCard card {distortos::board::spi3Master, slaveSelect, 5000000};
  distortos::LittlefsFileSystem fileSystem;

  for (size_t writeSize {512}; writeSize <= 2 * 1024 * 1024; writeSize *= 2)
  {
    {
      const auto ret = card.open();
      if (ret != 0)
      {
        fiprintf(debugStream, "card.open(): %d\r\n", ret);
        distortos::ThisThread::sleepFor(std::chrono::seconds{60});
      }
    }
    for (size_t i {}; i < 10; ++i)
    {
      fiprintf(debugStream, "====================\r\n%zu b sequential write, run %zu\r\n", writeSize, i + 1);
      const auto start = distortos::TickClock::now();
      {
        const auto ret = card.erase(0, writeSize);
        if (ret != 0)
        {
          fiprintf(debugStream, "card.erase(): %d\r\n", ret);
          distortos::ThisThread::sleepFor(std::chrono::seconds{60});
        }
      }
      {
        const auto ret = card.program(0, reinterpret_cast<const void*>(0x8000000), writeSize);
        if (ret.first != 0 || ret.second != writeSize)
        {
          fiprintf(debugStream, "card.program(): %d, %zu\r\n", ret.first, ret.second);
          distortos::ThisThread::sleepFor(std::chrono::seconds{60});
        }
      }
      const auto end = distortos::TickClock::now();
      const auto elapsed = end - start;
      fiprintf(debugStream, "elapsed: %lld ms\r\n", std::chrono::milliseconds{elapsed}.count());
    }
    {
      const auto ret = card.close();
      if (ret != 0)
      {
        fiprintf(debugStream, "card.close(): %d\r\n", ret);
        distortos::ThisThread::sleepFor(std::chrono::seconds{60});
      }
    }
    {
      const auto ret = distortos::LittlefsFileSystem::format(card);
      if (ret != 0)
      {
        fiprintf(debugStream, "distortos::LittlefsFileSystem::format(): %d\r\n", ret);
        distortos::ThisThread::sleepFor(std::chrono::seconds{60});
      }
    }
    {
      const auto ret = fileSystem.mount(card);
      if (ret != 0)
      {
        fiprintf(debugStream, "fileSystem.mount(): %d\r\n", ret);
        distortos::ThisThread::sleepFor(std::chrono::seconds{60});
      }
    }
    for (size_t i {}; i < 10; ++i)
    {
      fiprintf(debugStream, "====================\r\n%zu b file system write, run %zu\r\n", writeSize, i + 1);
      std::unique_ptr<distortos::File> file;
      {
        int ret;
        std::tie(ret, file) = fileSystem.openFile("dump.bin", O_WRONLY | O_CREAT | O_TRUNC);
        if (ret != 0 || file == nullptr)
        {
          fiprintf(debugStream, "fileSystem.openFile(): %d\r\n", ret);
          distortos::ThisThread::sleepFor(std::chrono::seconds{60});
          continue;
        }

      }
      const auto start = distortos::TickClock::now();
      {
        const auto ret = file->write(reinterpret_cast<const void*>(0x8000000), writeSize);
        if (ret.first != 0 || ret.second != writeSize)
        {
          fiprintf(debugStream, "file->write(): %d, %zu\r\n", ret.first, ret.second);
          distortos::ThisThread::sleepFor(std::chrono::seconds{60});
          continue;
        }
      }
      const auto end = distortos::TickClock::now();
      const auto elapsed = end - start;
      fiprintf(debugStream, "elapsed: %lld ms\r\n", std::chrono::milliseconds{elapsed}.count());
    }
    {
      const auto ret = fileSystem.unmount();
      if (ret != 0)
      {
        fiprintf(debugStream, "fileSystem.unmount(): %d\r\n", ret);
        distortos::ThisThread::sleepFor(std::chrono::seconds{60});
      }
    }
  }
@geky
Copy link
Member

geky commented Jul 16, 2018

Great analysis! Definitely several interesting observations here.

I had to double check, but there is in fact a quadratic operation to writing files that takes over as the file grows. It's not an operation related to the file-structure, but actually the block allocator. Which, due to COW files, gets called on every block write, even if we're modifying a file in place.

The block allocator by itself is O(n), but it's O(size of the filesystem, including the file we're writing to). So as our file grows, the cost to allocate a block grows linearly with it, resulting indirectly in a full cost of O(n^2) for a file write.

However littlefs has a lot of tricks in place to try to keep this from becoming completely terrible!


Ok, so I ended up with a lot of details below. Sorry about the wall of text. It's a relatively subtle problem, so hopefully it helps to know how everything works.

  • So how does the block allocator work?

    At it's core, littlefs's block allocator simply iterates over the entire filesystem to find unused blocks. The metadata blocks are set up to make this iteration as cheap as possible, but it can't be ignored that this is a O(size of the filesystem) operation.

    To keep this from being terrible, littlefs batches many block allocations at once, storing the blocks it finds in a compact bitmap. This is what I've called the lookahead buffer (I'm open to other names, but it'd have to be a really good name to break backwards compatibility).

    So with a lookahead of 512, the first allocation will scan the filesystem to see which of the first 512 blocks are in use. After all 512 blocks are exhausted, the allocator will move to the next 512 blocks, and so on. It's only a constant reduction, and doesn't change the asymptotic complexity, but the theory is that bitmaps are very small and provide a very large constant reduction.

    This gives you a knob you can turn to tradeoff speed for RAM. The idea being small devices can be slow, and fast devices can be large. This isn't always the case, but hopefully a reasonable compromise.

  • Why not a traditional allocator?

    Most traditional filesystems store used blocks in some sort of bitmap on the disk. (Logging filesystems are a neat exception because they're logs). These are relatively fast, but introduce several problems in the context of littlefs:

    1. Bitmaps need to be stored somewhere. littlefs distributes it's mutable metadata accross the directory blocks, but the bitmap needs to be centralized and efficient to access. Additionally the bitmap scales linearly with the size of disk, which makes storing copies in the directory blocks problematic.

    2. Bitmaps need to be updated every write. If they aren't stored in the directory blocks, updating the bitmap would introduce additional wear.

    3. Bitmaps add complexity to update logic. This isn't just me being lazy, but actually a concern in terms of code size. Especially when considering power-resilience. By introducing a separate disk structure to track free blocks, we create another point of failure where a power-loss or filesystem error could cause the filesystem to become out of sync with itself.

    Initially, I tried to include a distributed free structure in littlefs, however it ended up being too problematic. And being able to just "drop blocks on the floor" when they're no longer needed greatly simplified a lot of the code paths.

So, long story short, it's a big tradeoff of runtime for code size/complexity.

One nice thing about this approach is that because the allocator is only stored in RAM, it's very easy to modify the allocator without worrying about compatibility. So it may be possible to introduce a more complex/RAM intensive/fast allocator in a version of littlefs designed for more powerful systems.


So what can you do to improve the speed?

  1. You can try increasing the erase size. The tradeoff here is storage for speed. Fortunately SD cards have a lot of space! Larger erase sizes means fewer allocations, but larger wasted space for small files. 8KB erase sizes should give you the ~16x speed improvement. (Most FAT filesystems use a cluster size of ~4KB for similar reasons)

    Note: Larger erase sizes != more RAM. The buffers are only based on the prog/read sizes, which you should keep at 512 bytes.

  2. I think you noticed this, but you can try increasing the lookahead size. The tradeoff here is RAM for speed. A lookahead of 8192 (1KB of RAM) should give you the ~16x speed improvement.

Unfortunately, these are only constant improvements, but that's what we get for dealing with a quadratic function.

It'd be interesting to see if there is a way to fix this cost specifically for writing to a file. I wonder if we could store the bounds of all blocks in a file to keep the allocator from checking the file unless it needs to. Hmmm. May be worth investigating, though I'm not sure if then fragmentation becomes a concern.


I should also note that SD cards are not littlefs's strong suit. On most forms of flash, writing is an order of magnitude more expensive than reading, but this isn't true on SD cards and eMMC. Additionally, the relatively large read sizes makes hopping around the file system during an allocator scan a much more expensive operation that on devices with smaller read sizes. The quadratic growth is only a quadratic growth of reads, which isn't great, but better than a quadratic growth of writes.

I need to run, but there's still some interesting points in your second observation around different writes costs that you may start being able to piece together. I'll update with another comment about that in a bit.

@geky
Copy link
Member

geky commented Jul 16, 2018

The comparison of successive writes at different sizes is interesting and something I haven't seen before. I'm still trying to piece together exactly what's going on.

As you guessed the lookahead would explain a lot of the variations. Most notably the 128KB files are interesting as they alternate between files with a scan and files without (128KB = 256 blocks = 1/2 lookahead). As soon as that grows to 256KB files, the variation is gone as each file needs to do a scan.

The very small files only need to scan the lookahead once for the entire lifetime of the mount, so that would explain the initial cost, whereas the larger files need to scan multiple times for every file, so the initial scan (when the filesystem is empty) is actually relatively cheap.

Although I have no idea about the second spike for the 2MB file. (It's far too small to wrap around on your 32GB SD card). I'm wondering if it's actually caused by the FTL logic inside the SD card. Unlike the raw writes, the littlefs writes are moving accross the entire range of the SD card. It's possible that after a certain range of addresses the SD card slows down or needs to allocate/erase new blocks. I'm not entirely sure. I think I may try to build a model and see what the theoretical performance is for these operations and see how they compare, though it may be a bit before I can put this together.

@FreddieChopin
Copy link
Contributor Author

Thanks for your explanations! We'll see how well my application and littlefs will go along, but I hope that they will work together smoothly (; Even tough SD cards may not be a perfect fit for littlefs, I hope they are not a bad fit (; After all, these cards are huge, cheap and pretty fast. I think that once you need more than several hundred MB (and when production volume is rather small), the SD cards are the only thing that is left on the market with a reasonable price and good availability.

As for ideas to amortize the cost of allocator, with a big chip and a RTOS you could dedicate a low-priority thread for some background work related to file system. Maybe it would be possible to expose a function, which the user could call repeatedly when there's nothing else to do, which could maybe perform some of the work which would otherwise be done when next write is executed? Something like a background defrag/file-system-scan? Is that possible and worthwhile? From my perspective I guess it could be nice, as it seems that when there's a big file written on the file system, using small files (I've implemented a counter which gets incremented all the time) gets slow too - once in a few dozen/hundred writes, there's a huge delay which I assume is the allocator scan - the same delay can be observed when actually mounting the file system.

My use case is obviously not dumping several MB of raw data to files instantly, but generally I expect the application to also have huge and long files with logs and recorded measurements. These would be updated incrementally (for example adding several hundred bytes every minute) and I'm wondering whether in such scenario (appending "small" piece to the end of a "large" file) I should expect that the writes and seek would get slower with increasing file size. I guess I have to check that too (;

@FreddieChopin
Copy link
Contributor Author

Or here's another idea to mitigate the problem that once if a few hundred writes the allocation scan can take a really long time, which makes your "write a few bytes" operation take much more than expected (for example a minute instead of a fraction of a second). Maybe whenever one block from the allocator is used, the allocator would move on to find one new block? So instead of trying to find a few thousand blocks every few thousand writes, it would find just one with each write? Or would this basically be using a lookahead value of 1?

Combining that with my idea for exposing a public function which could be run from a low priority thread or when there's nothing better to do - this function with each call would try to find one more free block.

I'm not sure whether this ideas are possible (my understanding of file system internals is very low (; ), but I guess these would be a very good additions, as now the write time is far from being deterministic. I also see, that with time the performance decreases - my counter could be updated 20x per second when it was around 10000, after ~20000 writes it was updated only 3 times per second. After reset the performance is back to ~20x per second.

@geky
Copy link
Member

geky commented Jul 16, 2018

Even tough SD cards may not be a perfect fit for littlefs, I hope they are not a bad fit (; After all, these cards are huge, cheap and pretty fast. I think that once you need more than several hundred MB (and when production volume is rather small), the SD cards are the only thing that is left on the market with a reasonable price and good availability.

Ah, you're right, SD/eMMC should be a priority for littlefs (though second to raw NOR/NAND flash). At least for now they're the main form of storage where the scalability becomes a real concern.

As for ideas to amortize the cost of allocator, with a big chip and a RTOS you could dedicate a low-priority thread for some background work related to file system. Maybe it would be possible to expose a function, which the user could call repeatedly when there's nothing else to do, which could maybe perform some of the work which would otherwise be done when next write is executed?

That's not a bad idea at all. The only downside I can think of is that right now littlefs is RTOS agnostic, so it would probably need to be synchronized externally. And also the progress only affects RAM, so it wouldn't be useful for speeding up mount-to-write times.

From my perspective I guess it could be nice, as it seems that when there's a big file written on the file system, using small files (I've implemented a counter which gets incremented all the time) gets slow too - once in a few dozen/hundred writes, there's a huge delay which I assume is the allocator scan

Yep, unfortunately the cost of the allocator is spread over all files, including the small ones.

... the same delay can be observed when actually mounting the file system.

Ah! This is actually a slight different operation, the deorphan step, which I have a resolution for and am currently working on.

My use case is obviously not dumping several MB of raw data to files instantly, but generally I expect the application to also have huge and long files with logs and recorded measurements. These would be updated incrementally (for example adding several hundred bytes every minute) and I'm wondering whether in such scenario (appending "small" piece to the end of a "large" file) I should expect that the writes and seek would get slower with increasing file size. I guess I have to check that too (;

Ah! actually you should find that the seek and append are very efficient : )

file seek = O(log n)
file append = O(1)

It's just the allocation that is costly...

@geky
Copy link
Member

geky commented Jul 16, 2018

Or here's another idea to mitigate the problem that once if a few hundred writes the allocation scan can take a really long time, which makes your "write a few bytes" operation take much more than expected (for example a minute instead of a fraction of a second). Maybe whenever one block from the allocator is used, the allocator would move on to find one new block? So instead of trying to find a few thousand blocks every few thousand writes, it would find just one with each write? Or would this basically be using a lookahead value of 1?

Yeah it'd be a lookahead of size 1.

The main trick here is that scanning for 1 block is the same cost as scanning for any number of blocks, you just need the RAM to store them. If you had 8MB to spare (enough lookahead for 32GB at 512B blocks), you could scan once and file writes would be just the base O(1).

Combining that with my idea for exposing a public function which could be run from a low priority thread or when there's nothing better to do - this function with each call would try to find one more free block.

This is probably the best idea for an "lfs_fs_lookahead" function. That is, try to find as many blocks as we have RAM for if we haven't already. Though the weird thing here is that rescanning free blocks we have already scanned would be free anyways.

@geky
Copy link
Member

geky commented Jul 17, 2018

I also see, that with time the performance decreases - my counter could be updated 20x per second when it was around 10000, after ~20000 writes it was updated only 3 times per second. After reset the performance is back to ~20x per second.

Ok, actually that's a weird one. I suspect it's actually the SD card's FTL being problematic here. It probably ran out of erased blocks and needed to start allocating more (FTLs are actually as complex as filesystems). Though littlefs probably isn't helping by writing sequentially, which isn't comon for traditional file systems.

distortos::devices::SpiSdMmcCard card {distortos::board::spi3Master, slaveSelect, 5000000};

Also, you may be able to speed this up quite a bit by querying the card, I think they usually support around 25MHz. May be worth a try.

You can try increasing the erase size. The tradeoff here is storage for speed. Fortunately SD cards have a lot of space! Larger erase sizes means fewer allocations, but larger wasted space for small files. 8KB erase sizes should give you the ~16x speed improvement. (Most FAT filesystems use a cluster size of ~4KB for similar reasons)

Note: Larger erase sizes != more RAM. The buffers are only based on the prog/read sizes, which you should keep at 512 bytes.

Do try this if you can! I suspect bumping up the erase size is the best answer for speed limitation for now.

@FreddieChopin
Copy link
Contributor Author

FreddieChopin commented Jul 17, 2018

That's not a bad idea at all. The only downside I can think of is that right now littlefs is RTOS agnostic, so it would probably need to be synchronized externally.

My general idea was for this function to be called only optionally. You don't call it - you get exactly what littlefs is doing right now. If you call it - some operations would get a speed-up. Unless that complicates the code too much?

Moreover - the operation of this function should be "short", as it will require locking the file system. If it would do something very long while holding the lock, the situation is not that much improved anyway. That's why I'm talking just about "one call - one new block" instead of "... - whole new scan".

And also the progress only affects RAM, so it wouldn't be useful for speeding up mount-to-write times.

This is something I don't fully understand, unless you are saying that having such function will not reduce the time it takes for the mount. This is of course expected, however I think that these tasks could be offloaded to this function too (; Unless someone is really doing "mount-write-unmount" all the time, but in that case it would be a problem of the application <:

This is probably the best idea for an "lfs_fs_lookahead" function. That is, try to find as many blocks as we have RAM for if we haven't already. Though the weird thing here is that rescanning free blocks we have already scanned would be free anyways.

The important thing here is what I wrote in the second paragraph. This function should be kept as short (in time) as possible, otherwise it wouldn't really help that much. The general assumption would be that to really help, the user should call the function repeatedly over and over again - for example several hundred times.

Or you could actually expose an argument in this function - either a bool (full-and-long-scan or as-short-as-possible-scan) or just a number (amount of blocks to "update" in this call or just anything which would allow selecting the length of operation), so that a RTOS concerned with locking would select for this function to be short and call it 100x per second, while a bare-metal application could choose to do a full-and-long-rescan once in a while.

If you had 8MB to spare (enough lookahead for 32GB at 512B blocks), you could scan once and file writes would be just the base O(1).

If I would have 8 MB to spare life would be much nicer (; It's just a STM32F4 <:

Ok, actually that's a weird one. I suspect it's actually the SD card's FTL being problematic here. It probably ran out of erased blocks and needed to start allocating more (FTLs are actually as complex as filesystems). Though littlefs probably isn't helping by writing sequentially, which isn't comon for traditional file systems.

If that would be the case, I suspect that resetting the device wouldn't change anything (resetting without any power-cycle to the SD card, though it would of course be reinitialized again). I'll investigate that a bit more, as this seems to be interesting too.

Also, you may be able to speed this up quite a bit by querying the card, I think they usually support around 25MHz. May be worth a try.

Yes, that would of course be a good solution, but the SPI driver is interrupt based which turned out to be a very bad idea for an RTOS. With high frequency and some other activity going on (other interrupts especially), it easily "chokes" and detects data overruns /; So either there's some bug in the driver, or using SPI that way in RTOS is just stupid. Most likely the latter, so I have to rework that to use DMA some day... Heh, so much to do, so little time (;

Do try this if you can! I suspect bumping up the erase size is the best answer for speed limitation for now.

Currently my biggest concern is not the speed, but the variation of speed - that usually writing small file takes a fraction of a second, but sometimes it may take several dozen second, which may make the user of the device a bit nervous...

@FreddieChopin
Copy link
Contributor Author

FreddieChopin commented Aug 1, 2018

Any chance we could have the idea presented above (lfs_fs_lookahead() function) implemented in near future? I'm asking because unfortunately without it littlefs is unusable for the application I'm working on. This application is a monitoring/control device with a function to log data periodically. The incremental amount of data written is pretty small, as the data is coming from Modbus, which is itself pretty slow. I wouldn't expect anything more than maybe 1 kB per second in the most extreme cases. However, even with such low speed, it takes only a few days until the amount of data in the filesystem reaches dozens or hundreds of MB. And in this case the device becomes completely unresponsive. For example now I have uploaded 120 MB of data to the SD card (on PC, via lfs-fuse). The mount was successful and instantaneous. The first write operation which I attempted (a write of less than 100 bytes) is now taking 40 minutes already and is still in progress. Extrapolating the times I saw earlier (with the 2 MB flash dump mentioned above) I expect this operation to take something like 2-3 hours. This is not something which can be realistically handled in an application that uses watchdog...

Actually I think that the whole-system-scan which is done after first write is a wrong idea and this has to be removed (or maybe made optional). Even with something like lfs_fs_lookahead() function executed in some background thread, it would not be possible to write anything to the file system for a pretty long time. Even if I would ramp the transfers dramatically and increase lookahead/buffers to some crazy levels, the first operation would still be extremely slow. If we're talking about a storage which can easily have a few dozen GB of capacity, while the expected read speed is probably no more than ~1-3 MB/s (theoretical limit for SPI at 25 MHz would be something like 3.1 MB/s). So even with perfectly fast SPI transfers, scanning 120 MB of data would take at least 40 seconds...

Or maybe this is actually related to the "deorphan ste" you mentioned earlier?

Ah! This is actually a slight different operation, the deorphan step, which I have a resolution for and am currently working on.

Please let me know whether you'd prefer to continue discussing this issue here or maybe you'd prefer to have it as a new issue (this one is pretty long and "multithreaded"). Thanks in advance!

@geky
Copy link
Member

geky commented Aug 2, 2018

Sorry @FreddieChopin, currently my priority is to get the work around metadata logging, custom attributes, inline files done (#23, more info, dir-log branch), since this will have a big impact in the rest of the filesystem.

I have uploaded 120 MB of data to the SD card (on PC, via lfs-fuse). The mount was successful and instantaneous. The first write operation which I attempted (a write of less than 100 bytes) is now taking 40 minutes already and is still in progress.

Ouch!

Have you tried setting the block_size up to some ridiculously high value? The only cost is storage space, and SD cards have a lot of storage space.

Additionally, with the dir-log work, relatively small files (up to 4KB) can be stored inside the directory blocks. So in the near future large block sizes won't be that bad for small files.

Actually I think that the whole-system-scan which is done after first write is a wrong idea and this has to be removed (or maybe made optional). Even with something like lfs_fs_lookahead() function executed in some background thread, it would not be possible to write anything to the file system for a pretty long time. Even if I would ramp the transfers dramatically and increase lookahead/buffers to some crazy levels, the first operation would still be extremely slow.

Yeah, I suspect runing the lookahead in a background thread won't be that useful for embedded systems. There's still the first-time cost.

The reason for the whole-system-scan is that we need to keep track of free blocks. By not storing free blocks in the filesystem we can decrease the code size of the implementation significantly. Though I'm open to different ideas if there's another solution that works well.

If we're talking about a storage which can easily have a few dozen GB of capacity, while the expected read speed is probably no more than ~1-3 MB/s (theoretical limit for SPI at 25 MHz would be something like 3.1 MB/s). So even with perfectly fast SPI transfers, scanning 120 MB of data would take at least 40 seconds...

So, we actually only need to find each block address. The issue is that because we're using a linked-list, if our read_size is large we end up reading the nearby data. This is why increasing the block_size (but not the read_size) will reduce the scan cost.

One long-term option may be to add support for "indirect ctz-lists" (thought up here), where each block is actually just an array of pointers to other blocks. This would make the most use out of large reads, though would require a bit of work to see if the data structure could reuse the same logic as the normal ctz-lists.

Or maybe this is actually related to the "deorphan ste" you mentioned earlier?

Ah! This is actually a slight different operation, the deorphan step, which I have a resolution for and am currently working on.

So the deorphan step is this big O(n^2) scan for orphans, where n is the number of metadata blocks. But it doesn't need to look at the file list at all, so large files don't matter, but it becomes an issue if you have a large number of files.

It's actually already been removed in the dir-log branch : )

@FreddieChopin
Copy link
Contributor Author

Sorry @FreddieChopin, currently my priority is to get the work around metadata logging, custom attributes, inline files done (#23, more info, dir-log branch), since this will have a big impact in the rest of the filesystem.

Would it be possible to make the issue of very-long-first-write-after-mount a second priority then? (;

Have you tried setting the block_size up to some ridiculously high value? The only cost is storage space, and SD cards have a lot of storage space.

I'll try that tomorrow, but I suspect that this will only decrease the time proportionally. We'll see...

Additionally, with the dir-log work, relatively small files (up to 4KB) can be stored inside the directory blocks. So in the near future large block sizes won't be that bad for small files.

Would that be "no more than 1 file, no more than ~4 kB, per 1 directory" or maybe "any number of files as long as combined size is less than ~4 kB, per 1 directory"?

The reason for the whole-system-scan is that we need to keep track of free blocks. By not storing free blocks in the filesystem we can decrease the code size of the implementation significantly. Though I'm open to different ideas if there's another solution that works well.

I'm in no position to suggest alternatives, as my knowledge about file systems is pretty low. I'm just trying to understand what is going on and whether its a fundamental flaw which cannot be removed without redesigning the whole thing or maybe this can actually be fixed in the future. As with everything else, it always comes to the balance of different aspects (like complexity, speed, RAM usage, wear, features, ...). Don't get me wrong, as I really like the overall feature set of littlefs and power resilience is a very important feature. However - as it is now - the solution just doesn't scale well... People will use this project with SD cards (I suspect much more often than with raw memory chips) and will put a lot of data in the file system, which is when they will hit the same issue that I'm facing now. The first operation may take a bit longer than usually, but in my opinion anything above ~10 seconds (assuming that this first operation is not "dump 100 MB of data") is unacceptable and cannot be reliably used in a real time environment );

If you ask me, then the file system may use 2x as much flash and 2x as much RAM as long as the whole-system-scan could be omitted. And from what you wrote, the only option to actually omit that would be to store the info about free blocks in the storage. Whether you store this as a tree or a list or something in-between doesn't really matter, as long as this info is available right-away after mount.

All attempts to just make this faster (instead of omitting it) are just postponing the moment the problem starts to be noticeable and starts causing trouble. Assuming the theoretical limit of 3 MB/s read speed it takes only ~60000 used blocks to reach 10 seconds (assuming the size of read block is 512 B). I'm not trying to complain just for the sake of complaining, just looking for the perfect reason that would convince you (;

@geky
Copy link
Member

geky commented Aug 2, 2018

I'll try that tomorrow, but I suspect that this will only decrease the time proportionally. We'll see...

It is more a short-term solution than anything, though may work well enough for your use case.

Would that be "no more than 1 file, no more than ~4 kB, per 1 directory" or maybe "any number of files as long as combined size is less than ~4 kB, per 1 directory"?

Ah, 4KB per file and 1024 files per directory block, unlimited directory blocks in a directory.

I'm not trying to complain just for the sake of complaining, just looking for the perfect reason that would convince you (;

I didn't think you were complaining, I owe you a sincere thanks for the constructive criticisms, it is very valuable : )

Just wanted to post the quick responses, I'll be able to respond more in a bit.

(One note is from my perspective I've actually seen more consumers of small SPI flash chips than of SD cards. Though this doesn't change the fact that scalability is still a concern. Storage never seems to get smaller.)

@geky
Copy link
Member

geky commented Aug 6, 2018

The reason for the whole-system-scan is that we need to keep track of free blocks. By not storing free blocks in the filesystem we can decrease the code size of the implementation significantly.

I was a wrong in saying this, the main issue isn't code size cost, but the fact that littlefs is built entirely around the idea that it doesn't need to track free blocks. Changing this will require a rewrite of most of the code and be a significant effort. Before shifting focus I want to make sure the existing work is in (#85).

Using small files (I've implemented a counter which gets incremented all the time) gets slow too

One thing I just realized actually, the lookahead scan is only incurred when a block is allocated. With #85, if your files fit in inline files, you can avoid this cost. Again this is just a work-around if anything.

And from what you wrote, the only option to actually omit [a O(fs^2) operation] would be to store the info about free blocks in the storage. Whether you store this as a tree or a list or something in-between doesn't really matter, as long as this info is available right-away after mount.

Excellent observation, I believe you're right about this.


So idea! I've been thinking about this issue and think I have a good solution:

  1. Store a list of free blocks. As you mentioned this is probably the only way to beat the O(fs^2) cost.

    I was considering a bitmap, but I think the best solution is to just store each block as a 32-bit address in a file. This allows us to deallocate (append to list) in O(1) and reuse all of the existing logic for building files.

    Rather than storing this list in the superblock, we can store it in the distributed globals being introduced in v2: Metadata logging, custom attributes, inline files, and a major version bump #85. Which means any metadata block update can atomically update the free list.

  2. Downside: We have to store O(disk) addresses, and looking up an address at allocation time has to seek to the beginning of a file

    Right now this seek cost is O(log n). For an 8 GB SD card, this would require 17 reads (log2(((8GB / 512) * 4) / (512-8))) if every block as free. Counterintuitively, the allocator would actually speed up as more blocks are in use.

  3. Big thing to note, this solution can live next to the existing allocator, and the existing allocator can be used to build the free list if it's missing.

    This means the free-list allocator can be introduced in a backwards compatible manner, and that the free-list allocator can be omitted on small ROM devices if there is a significant code cost.

    And more interestingly, the two allocators may compliment each other. The whole-system-scan allocator works fine for small filesystems (note that all of this is relative), but becomes very costly when the filesystem is large. Using a free-list becomes very cheap when a filesystem is full, but an empty filesystem has to maintain a large free-list.

    I'm thinking there may be a way to keep the advantages of both by switching from the whole-system-scan to the free-list when the filesystem reaches a certain size. Though the switch may be a bit messy. Maybe once a certain size is reached, littlefs could start building the free-list from deallocations, though it would take a while for the free-list to contain all unused blocks and littlefs can start trusting it.


Anyways just a thought. There are a lot of details not fleshed out, and we'll need a prototype before we can really see how it behaves.

I'm not going to lie, it's probably going to take at minimum a month or two (or three? I'm not good a predicting time) before this can get in. I want to make sure this is the right design decision, because right now the design is in a relatively safe position. By not having a free-list, we have the opportunity to add one in a backwards compatible manner. But once we add one we're stuck with it unless we want to break disk compatibility : )

@FreddieChopin
Copy link
Contributor Author

If I understood you correctly (I'm not 100% certain about a few things), you would like the list where each node is exactly "1 block". This has a potential problem, that a format operation of an X GB storage would actually need to write X GB of data so it would be pretty slow. Maybe it would be possible to store the list of nodes with variable size? So initially you would have a free list with 1 element which would cover the whole storage. If you want to allocate a block (or a few) you just cut it from the beginning of this huge element. After deallocation you could (but don't actually have to) merge adjacent free blocks into one. This would be basically a 1:1 copy of how malloc() works in RAM. Without the merge step this shouldn't be much more complicated and the format operation would be extremely trivial. The merge step would probably work fine only if you'd keep the list sorted by the address, but this would make things both slower and more complicated (or maybe not?), for no obvious gain (is there a need to allocate more than 1 block at a time?).

And more interestingly, the two allocators may compliment each other. The whole-system-scan allocator works fine for small filesystems (note that all of this is relative), but becomes very costly when the filesystem is large. Using a free-list becomes very cheap when a filesystem is full, but an empty filesystem has to maintain a large free-list.

I'm thinking there may be a way to keep the advantages of both by switching from the whole-system-scan to the free-list when the filesystem reaches a certain size. Though the switch may be a bit messy. Maybe once a certain size is reached, littlefs could start building the free-list from deallocations, though it would take a while for the free-list to contain all unused blocks and littlefs can start trusting it.

I wouldn't go that way personally. This way you introduce some nondeterministic behaviour and need to have 2 code paths running simultaneously. With slow storage the size of fs where things get slow is not that significant anyway. With good design the stored-free-list can be robust (it will be faster, no matter how small the storage or fs is). With only one possible allocator the code will be smaller & simpler, you'd have only 1 combination to test and a few tweakable parameters less. If the list can contain variable sized entries the initial size of list is extremely small.

Additionally, the stored-free-list would improve wear levelling too. Currently after each mount the writes start from the free block with lowest address. On huge storage (or with unwise application which unmounts after each write) this means that the blocks near the end get no writes, while the blocks at the beginning get almost all of them (excluding the case where files are only appended, never modified in place). With a FIFO list the writes would be levelled in all cases.

And this is something I don't fully understand too:

I was considering a bitmap, but I think the best solution is to just store each block as a 32-bit address in a file.

What do you mean by this? Each existing file would have an address of the head of the list?

I'm not going to lie, it's probably going to take at minimum a month or two (or three? I'm not good a predicting time) before this can get in.

Such time-frame is perfectly fine, so let me thank you in advance for considering this change! I'll obviously can help with testing (my 32 GB card is a willing test subject [; ), maybe I can help with some decisions or ideas too - if needed.

I want to make sure this is the right design decision, because right now the design is in a relatively safe position.

I'm sure this direction is correct and this will make littlefs much more scalable! Implementation details will surely need some work and thought. We can for sure help you out with thinking and reviewing the ideas. I've read the description of the littlefs design over the weekend, so I should not write stupid things anymore (or so often [; ).

BTW - about the cost to seek to the beginning of file. Wouldn't it be possible to just have the pointer to the head of the file in the metadata? I know you cannot go anywhere from the beginning anyway (because the files are stored in unidirectional "reversed" list), but if you want to store some useful metadata at the beginning then this could be beneficial. Or maybe just store this useful metadata in metadata blocks, not in the file's data blocks?

@dl-dl
Copy link

dl-dl commented Aug 9, 2018

I would suggest to consider implementation of some fixed size writhe-through cache on top of block device. Perhaps different cache algorithms/sizes for different use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants