Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider use of fdatasync() #22

Closed
hhoffstaette opened this issue Apr 30, 2014 · 16 comments
Closed

Reconsider use of fdatasync() #22

hhoffstaette opened this issue Apr 30, 2014 · 16 comments

Comments

@hhoffstaette
Copy link

Nocache is a great way to prevent buffer cache pollution e.g. by rsync.

Unfortunately I just found out the hard way that it also totally destroys rsync performance with small-file workloads due to calling fdatasync() (via sync_if_writable() in free_unclaimed_pages()). Since this happens for every little file (incl. temp files!) on close(), performance becomes completely unacceptable. Not only is it dog slow, it is also not helpful for drive longevity.

I have for now preloaded libeatmydata in addition to nocache, which negates this effect and restores performance; however this should not be necessary. fadvise() works just fine without prior fsync() - buffer pages are flushed out in nice batches, and still marked for reuse.

Unless there is a technical reason that I'm unaware of please consider removing the use of fdatasync().

@Feh
Copy link
Owner

Feh commented Apr 30, 2014

Hi,

fadvise() works just fine without prior fsync() - buffer pages are flushed out in nice batches, and still marked for reuse.

The posix_fadvise() man page says regarding POSIX_FADV_DONTNEED semantics:

   Pages that have not yet been written out will be unaffected, so if  the
   application  wishes to guarantee that pages will be released, it should
   call fsync(2) or fdatasync(2) first.

My reading of this is that with a slow disk you would potentially get cache thrashing (open, write and close before the disk even seeks to the position)… right?

Maybe we should provide an option/env var to make the fdatasync() call optional, and set it to “make the call” by default to preserve current behaviour?

Julius

@Feh
Copy link
Owner

Feh commented Apr 30, 2014

Addendum: Let me quote Tobi Oetiker on this:

As you can see we are calling fdatasync right before calling posix_fadvise, this makes sure that all data associated with the file handle has been committed to disk. This is not done because there is any danger of loosing data. But it makes sure that that the posix_fadvise has an effect. Since the posix_fadvise function is advisory, the OS will simply ignore it, if it can not comply. At least with Linux, the effect of calling posix_fadvise(fd,0,0,POSIX_FADV_DONTNEED) is immediate. This means if you write a file and call posix_fadvise right after writing a chunk of data, it will probably have no effect at all since the data in question has not been committed to disk yet, and therefore can not be released from cache.

@noushi
Copy link
Contributor

noushi commented Apr 30, 2014

Maybe we could add an option to postpone fdatasync() and posix_fadvise() right before we get out of the program, by registering files on some list and processing them in a bulk.
atexit() (and close() for a specific file) looks like the right candidate to do such a thing.

@Feh
Copy link
Owner

Feh commented Apr 30, 2014

It’s a nice idea, but I wonder if it helps in this use case (lots of very small files). Do you propose to leave the fd’s open? In that case we will easily overflow somewhere if we just leave open the fd’s when the caller does a close() on them, e.g. we will have thousands of fd’s but a FD_SETSIZE is only 1024 for select(). And if you close the fd on the caller’s request, how will you then “forget” the associated pages later on?

@noushi
Copy link
Contributor

noushi commented Apr 30, 2014

If the user calls close(), we should indeed close it and not keep stale fds.

I haven't thought this through yet, but the general idea remains. I'm speaking my mind here:

  1. we keep the filenames not the fds (possible issue with deleted files, maybe we ignore it)
  2. we process files in batch at the "end"
  3. what happens when user calls unlink()? I suspect we have nothing to do.

@Feh
Copy link
Owner

Feh commented Apr 30, 2014

ad 1., just considering the possibility – We have no clue about what happens to a file after we close the fd. We might be right in most cases by remembering the file name and then acting on a re-opened version of that. But how do you find the name of the file that was opened? Consider multiple hard links on the same inode etc. – I think this adds a lot of potential for errors or inconsistent behavior, while not really gaining much.

@noushi
Copy link
Contributor

noushi commented Apr 30, 2014

I know there are issues about that in general, but in this specific case (rsync and possibly tar and others), it makes sense.
Also, this could be enabled only through some option, not by default.

how do you find the name of the file that was opened?

two ways:

  1. when we open() the file <-- better
  2. fstat() on the fd

Regarding hardlinks, it doesn't matter which hardlink you use to reference the corresponding cached pages, they're physically the same. Am I missing something here?

@hhoffstaette : that said, isn't there a dontneed feature for rsync that's already available?

@Feh
Copy link
Owner

Feh commented May 8, 2014

If the user calls close(), we should indeed close it and not keep stale fds.

FWIW: We could close it upon user request, but just before that use fcntl and F_DUPFD to duplicate it to a high number and clean it up later (or periodically, since “later” might mean billions of files later, which might be a problem).

@hhoffstaette: Any thoughs on the ideas discussed here…?

@hhoffstaette
Copy link
Author

@Feh Sorr for the silence. Yes, lots of ideas and even started to hack on it a bit already. :)
I needed to verify that a certain system call works as expected and think things through in terms of complexity, downsides etc. Essentially your last comment is on the right track, but we need to bound the accumulated work in terms of size and space.
Will provide a summary soon - later today or over the weekend.

@hhoffstaette
Copy link
Author

@noushi Yes, I had Tobi Oetiker's original --drop-cache patch in rsync, but it had the same behaviour (fsync for every file) and was more or less unmaintained/unhackable. Also the rsync maintainers continue to refuse to merge it - probably a good thing in hindsight, considering the bad performance implications. nocache with its library/wrapper script approach is more versatile.

@Feh
Copy link
Owner

Feh commented May 9, 2014

I needed to verify that a certain system call works as expected and
think things through in terms of complexity, downsides etc.
Essentially your last comment is on the right track, but we need to
bound the accumulated work in terms of size and space.

Of course. Patches and pull requests welcome! :-)

@carlos-carvalho
Copy link

I'd like to use nocache in the server for rsync backups. There rsync will only read files, so it'd be best to use POSIX_FADV_NOREUSE, without any syncs of course. Therefore I think an option to chose the argument to posix_fadvise would be worth it.

@Feh
Copy link
Owner

Feh commented Jul 14, 2014

@hhoffstaette Any new developments on your part? How would you think about a simple option that’ll just disable calling fsync()?

@carlos-carvalho
Copy link

Julius Plenz (notifications@github.com) wrote on 14 July 2014 12:48:

[1]@hhoffstaette Any new developments on your part? How would you think
about a simple option that’ll just disable calling fsync()?

And the change to POSIX_FADV_NOREUSE instead of POSIX_FADV_DONTNEED.
On the server rsync reads files only once.

@Feh
Copy link
Owner

Feh commented Jul 15, 2014

And the change to POSIX_FADV_NOREUSE instead of POSIX_FADV_DONTNEED. On the server rsync reads files only once.

AFAICS, this would make the semantics correct, and that’s what nocache is also trying. But beware:

/* Hint we'll be using this file only once;
 * the Linux kernel will currently ignore this */
fadv_noreuse(fd, 0, 0);

As you can see in mm/fadvise.c, POSIX_FADV_NOREUSE is a no-op. So doing it semantically correct would mean the tool essentially does nothing for the (Linux) user.

@hhoffstaette
Copy link
Author

hhoffstaette commented Dec 1, 2018

While looking at the (unrelated) current issue with glibc 2.28 I just realized I never commented on
this. Since it's been >4 years I guess I might as well get over it.. :)

Long story short: I never ended up with the planned "complicated" solution of batching small-file writes together and doing syncfs() for a whole target fs, with a delayed fadvise() for the whole batch. Instead I just applied a patch that simply skips fsync() for small files < 8MB, and that fixed 99% of my performance problem while at the same time cleaning the page cache "well enough" after copying larger files.

If anybody cares, the commit in question is here.

Maybe this can be an inspiration for an official flag to reduce fsync() overhead. Given all my other commitments I don't see myself contributing to this anymore any time soon.

@hhoffstaette hhoffstaette closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants