Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Extra programming buffer for metadata or waiting to commit the new file until closed? #942

Open
kyrreaa opened this issue Feb 12, 2024 · 7 comments

Comments

@kyrreaa
Copy link

kyrreaa commented Feb 12, 2024

I've been looking at how lfs currently creates a new file and it appears it could use more caching if RAM is not at a premium.
Committing a file id, and then a record of filename accompanied by a inline tag with length 0 to be able to remember the name and the fact that it is 0 size will in most cases not be very useful.
If one could wait with committing the file until one has a write that determines if an inline data attribute is needed, or if a CTZ may be used could be useful? One could potentially even wait until file is closed as there really is not much of a difference between a set of metadata saying a file exist and is 0 length or nothing at all when a whole CTZ linked array of blocks now may exist on the disk without any references to it. Regardless of what was committed the filesystem would still be in a valid state.
If a full write_size buffer of memory is provided for the metadata it could potentially be written by calling sync() or when filled up. Allowing the pcache to work with file-data only? This could also reduce the need for garbage collection / compacting as well.
I could see it would be nice to allow user control of this metadata flush to file/dir close, it being filled up or upon manual sync depending on situation and user needs.

@BenBE
Copy link

BenBE commented Feb 12, 2024

Even a get_file_contents/put_file_contents style API with provided buffers could be nice.

@kyrreaa
Copy link
Author

kyrreaa commented Feb 12, 2024

Well, if you already have some file content planned I see the idea but the api for FS in Zephyr for example does not allow for that. It does however allow for a sync() and as stated, filesystem is in a valid state if filename is not yet committed to fs or if it has been, but users intent is not reflected by a 0 length file when in fact he may even have written several blocks of data but never got to close the file. Powerloss would be nicer if the initial file were either what the user wanted, or not at all?

@BenBE
Copy link

BenBE commented Feb 12, 2024

I think if the state upon power-loss would be no file this could be closer to what the user wanted (assuming the close to be a sync point). But on the other hand IIRC open+create on POSIX would be atomic too (and thus create an empty file), thus both views are valid.

@geky
Copy link
Member

geky commented Feb 18, 2024

This issue comes at a really interesting time as I recently tried to implemented this (delaying file creation until first sync/close).

Though my motivation was more to avoid 0-sized files, as they a real pain point for users, rather than performance/disk util.

This commit may be worth a read, though I'll summarize: ba505c2

ba505c2 Implemented scratch file basics

"Scratch files" are a new file type added to solve the zero-sized
file problem. Though they have a few other uses that may be quite
valuable.

The "zero-sized file problem" is a common surprise for users, where what
seems like a simple file create+write operation:

  lfs_file_open(&lfs, &file, "hi",
          LFS_O_WRONLY | LFS_O_CREAT | LFS_O_EXCL);
  lfs_file_write(&lfs, &file, "hello!", strlen("hello!"));
  lfs_file_close(&lfs, &file);

Can end up create a zero-sized file under powerloss, breaking user
assumptions and their code.

The tricky thing is that this is actually correct behavior as defined by
POSIX. open with O_CREAT creats a file entry immediately, which is
initially zero-sized. And the fact that power can be lost between open
and close isn't really avoidable.

But this is a common enough footgun that it's probably worth deviating
from POSIX here.

But how to avoid zero-sized files exactly? First thought: Delay the file
creation until sync/close, tracking uncreated files in-device until
then. This solves the problem and avoids any intermediary state if we
lose power, but came with a number of headaches:

  1. Since we delay file creation, we don't immediately write the filename
    to disk on open. This implies we need to keep the filename allocated
    in RAM until the first sync/close call.

    The requirement to keep the filename allocated for new files until
    first sync/close could be added to open, and with the option to call
    sync immediately to save the filename (and accept the risk of
    zero-sized files), I don't think it would be that bad of an API.

    But it would still be pretty bad. Extra bad because 1. there's no
    way to warn on misuse at compile-time, 2. use-after-free bugs have a
    tendency to go unnoticed annoyingly often, 3. it's a regression from
    the previous API, and 4. who the heck reads the more-or-less same
    open documentation for every filesystem they adopt.

  2. Without an allocated mid, tracking files internally gets a lot
    harder. The best option I could think of was to keep the opened-file
    linked-list sorted by mid + (in-device) file name.

    This did not feel like a great solutiona and was going to add more
    code cost.

  3. Handling mdir splits containing uncreated files adds another
    headache. Complicated lfsr_mdir_estimate further as it needs to
    decide in which mdir the uncreated files will end up, and potentially
    split on a filename that isn't even created yet.

  4. Since the number of uncreated files can be potentially unbounded, you
    can't prevent an mdir from filling up with only uncreated files. On
    disk this ends up looking like an "empty" mdir, which need specially
    handling in littlefs to reclaim after powerloss.

    Support for empty mdirs -- the orphaned mdir scan -- was already
    added earlier. We already scan each mdir to build gstate, so it
    doesn't really add much cost.

Notice that last bullet point? We already scan each mdir during mount.
Why not, instead of scanning for orphaned mdirs, scan for orphaned
files?

So this leads to the idea of "scratch files". Instead of actually
delaying file creation, fake it. Create a scratch file during open, and
on the first sync/close, convert it to a regular file. If we lose power,
scan for scratch files during mount, and remove them on first write.

Some tradeoffs:

  1. The orphan scan for scratch files is a bit more expensive than for
    mdirs on storage with large block sizes. We need to look at each file
    entry vs just each mdir, which pushed the runtime up to O(BlogB) vs
    O(B).

    Though if you also consider large mtrees, the worst case is still
    O(nlogn).

  2. Creating intermediate scratch files adds another commit to file
    creation.

    This is probably not a big issue for flash, but may be more of a
    concern on devices with large prog sizes.

  3. Scratch files complicate unrelated mkdir/rename/etc code a bit, since
    we need to consider what happens when the dest is a scratch file.

But the end result is simple. And simple is good. Both for
implementation headaches, and code size. Even if the on-disk state is
conceptually more complicated.

You may have noticed these scratch files are basically isomorphic to
just setting an "uncreated" flag on the file, and that's true. There may
have been a simpler route to end up with the design, but hey, as long as
it works.

As a plus, scratch files present a solution for a couple other things:

  1. Removing an open file can become a scratch file until closed.

  2. Scratch files can be used as temporary files. Open a file with
    O_DESYNC and never call sync and you have yourself a temporary file.

    Maybe in the future we should add O_TMPFILE to avoid the need for
    unique filenames, but that is low priority.

Initially, my thinking was the same as @kyrreaa's, wait until the first sync/close to write out anything about the file. But as I was trying to implement this, I kept running into problems:

  1. We need to store the filename somewhere.

    If we're not writing the filename to disk in open, we need to write the filename in sync/close, which means we need the filename at that point.

    The best option I found was to just leave this up to the user. Iff open is called on a non-existant file, the user must keep the path string allocated until the first sync/close.

    I was fully prepared to adopt this API if everything else worked. But it would shot users in the foot. Which probably makes this the worst API I've ever seriously considered.

  2. By not assigning each file a metadata pair and id, tracking opened files becomes a lot harder.

    Consider how multiple open files interact:

    lfs_file_open(&lfs, &fileB, "B", LFS_O_WRONLY | LFS_O_CREAT) => 0; // feel free to shuffle these
    lfs_file_open(&lfs, &fileC, "C", LFS_O_WRONLY | LFS_O_CREAT) => 0; //
    lfs_file_open(&lfs, &fileD, "D", LFS_O_WRONLY | LFS_O_CREAT) => 0; //
    lfs_file_open(&lfs, &fileA, "A", LFS_O_WRONLY | LFS_O_CREAT) => 0; //
    
    ...
    
    lfs_file_close(&lfs, &fileA) => 0; // who gets what id?
    lfs_file_close(&lfs, &fileB) => 0; //
    lfs_file_close(&lfs, &fileC) => 0; //
    lfs_file_close(&lfs, &fileD) => 0; //

    Though there are also issues with a single open file and unrelated removes/renames/etc.

    The is solvable by maintaining, say, a linked-list sorted by pair + id + filename, but it adds quite a bit of complexity.

  3. (Future, but relevant for ba505c2) If we add inline-files that aren't RAM bound, we may end up actually writing the inline data to disk before the filename.

    As you can probably imagine, this creates a whole bunch of extra problems:

    • When we split an mdir, we need to include unassociated inline data in our partitioning.

    • It's possible for mdirs to contain only unassociated inline data, leaving it empty if there is a powerloss. Empty mdirs can't be naturally reclaimed, so we need to scan for empty mdirs during mount.

I think all of these problems ended up with at least theoretical solutions. But it also feels like the design is sort of fighting itself.

And if we already would need to scan for empty mdirs during mount... Just setting an I don't exist yet! flag on newly created files is a whole lot simpler...


So that's my thoughts: an I don't exist yet! flag is a better way to solve the 0-sized file problem in littlefs's design.

(or more likely a new file type because of encoding reasons, but that's details)

But I hadn't really considered the performance savings of not writing the filename.

Still, if the tradeoff is one prog_size write for the above complexity (and code cost), I think it's not going to be worth it. littlefs is not going to outperform a larger-scale filesystem anyways.


But on the other hand IIRC open+create on POSIX would be atomic too (and thus create an empty file), thus both views are valid.

Yeah, POSIX expects open to create a file, so I think you're right. But given the number of issues that have been opened over this, I think avoiding 0-sized files is the best option for allowing users to easily write power-safe code.

They can always call sync if they really want a 0-sized file.

We already deviate from POSIX a bit when it comes to power-loss quality-of-life things. Truncate doesn't take effect until sync/close for example.

...

Actually, reading the POSIX O_CREAT docs (here), I think they don't actually specify what size the new file should be? We could make new files 12-bytes for fun :)

@geky
Copy link
Member

geky commented Feb 18, 2024

Even a get_file_contents/put_file_contents style API with provided buffers could be nice.

I do want to add these eventually. I was thinking just lfs_get/lfs_set, since then lfs_get/lfs_set/lfs_remove start looking a lot like a simpler key-value API, and matches lfs_getattr/lfs_setattr/lfs_removeattr quite nicely.

They're a win-win for everyone as they're a simpler API if you don't really need to manipulate file content, and if you only use get/set you can likely trim off a decent amount of code.

All non-disk APIs are strictly low-priority for me atm, given other things.

but the api for FS in Zephyr for example does not allow for that.

Zephyr could always map sync to get/set behind the scenes if that accomplishes better behavior.

Though while I don't want to make Zephyr dev's jobs harder, littlefs really shouldn't be limited by what APIs do or don't exist elsewhere.

@kyrreaa
Copy link
Author

kyrreaa commented Feb 19, 2024

Yeah, POSIX expects open to create a file, so I think you're right. But given the number of issues that have been opened over this, I think avoiding 0-sized files is the best option for allowing users to easily write power-safe code.

I think the issues people have really stems from the same that I did, the loss of power or SW reset due to bug in code etc causing the file to appear as 0-length even though data may actually have been written to many blocks. Since no reference is ever written regarding file length change or blocks allocated and filled the average user can not recover any data.
Could this be elleviated by activating "extra journalling" if user wanted? Something that for example updates the metadata with reference to block that is being allocated to file parhaps, then as a block is completed either updates with a temporary tag for the last complete block or some other ref? What I am thinking of is basically something that would under normal circumstances not be read but would with cklfs() or other be used to detect incomplete files that user may wish to recover. The blocks referenced to by such metadata would be counted as used by the lookahead to prevent overwriting and metadata of this character would be preserved over a compaction unless file was explicitly deleted?

@geky
Copy link
Member

geky commented Feb 28, 2024

I think this discussion is starting to converge with #523 (comment).

Could this be elleviated by activating "extra journalling" if user wanted?

Journals are logs and littlefs's mdirs are logs, so I think this is roughly equivalent to calling lfs_file_sync periodically. That is, if you ignore performance implications, because right now littlefs needs to rewrite the last block on append due to padding alignment issues.

Note that lfs_file_sync gives the user control over what data boundaries are atomic. Not all applications can recover partially written data easily.

There's a batch of work that aims to address a number of the performance problems in littlefs. I think we need to at least wait for that to land before considering new APIs that are mostly performance motivated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants