Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss due to non-atomic writes #440

Closed
kentonv opened this issue Jul 1, 2016 · 11 comments
Closed

Data loss due to non-atomic writes #440

kentonv opened this issue Jul 1, 2016 · 11 comments
Labels

Comments

@kentonv
Copy link

kentonv commented Jul 1, 2016

storage.py contains many clauses that update files by overwriting them, like:

with open(path, "w", encoding=self.storage_encoding) as fd:
    fd.write(item.serialize())

Unfortunately, if a crash or machine failure occurs between the first line and the second line, the result is that the item will be completely lost, since open() erases all existing data and the data isn't replaced until the write.

Note that on a typical Linux system, there may be a full 30 seconds between the call to write() and the data actually being flushed to disk. Therefore, after this code, there is a 30-second window in which a power failure will cause data loss.

This is a real problem for people running Radicale on https://oasis.sandstorm.io: In such large distributed systems, "power failures" -- or events that look a lot like power failures, such as losing the network connection to the storage volume -- are common. We've now had two independent reports of Radicale losing people's calendars on Oasis.

To fix this problem, Radicale should always write data to a temporary file first, and then os.rename() the temporary over the original. Technically it should also do os.fsync(f.fileno()) immediately before the rename (although on Linux / ext4 systems this is not strictly necessary).

@untitaker
Copy link
Contributor

python-atomicwrites handles this usecase although it might be overkill

@liZe
Copy link
Member

liZe commented Jul 13, 2016

@untitaker I've JUST RIGHT NOW found your project too. What a timing.

@liZe
Copy link
Member

liZe commented Jul 13, 2016

(Or you've just seen my last comment about this in #372, magic's not always magic.)

@liZe liZe closed this as completed in 4c91ee8 Jul 13, 2016
@liZe
Copy link
Member

liZe commented Jul 13, 2016

I've used atomicwrites for now. It works, but I'd be happy to have real life tests.

@kentonv
Copy link
Author

kentonv commented Jul 14, 2016

Thanks!

I'm not sure how to test this kind of problem, but I'll certainly let you know if there are any other reports of data loss after this update.

@Unrud
Copy link
Collaborator

Unrud commented Jul 17, 2016

This issue is not solved. AtomicWriter does not provide durability on non-windows systems (see untitaker/python-atomicwrites#12). Collection.delete also does not ensure that the changes are really written to disk.

It's still possible that deleted events reappear in clients and that created events or changes to existing events disappear. Only broken (partially written) items can't appear.

@kentonv
Copy link
Author

kentonv commented Jul 17, 2016

To be clear:

  • The fix as-implemented currently will ensure that people don't randomly lose their whole calendar unexpectedly, because the update is now atomic -- there's no point in time where a crash would lead to the file (both old and new versions) being totally lost.
  • However, until you fsync() the directory, it is perhaps possible that the replacement operation hasn't been committed to disk. A power outage or device disconnect at this point could mean that the modification is lost, which is bad if the user has been informed that the write succeeded. This is less disastrous than losing the whole calendar, but it's worth fixing as well.

@untitaker
Copy link
Contributor

It is problematic because a "200 OK" could've been returned shortly before the system crashed and reverted the changes.

Unfortunately I don't have access to a PC at the moment.

On 18 July 2016 00:30:20 CEST, Kenton Varda notifications@github.com wrote:

To be clear:

  • The fix as-implemented currently will ensure that people don't
    randomly lose their whole calendar unexpectedly, because the update is
    now atomic -- there's no point in time where a crash would lead to the
    file (both old and new versions) being totally lost.
  • However, until you fsync() the directory, it is perhaps possible that
    the replacement operation hasn't been committed to disk. A crash at
    this point could mean that the modification is lost, which is bad if
    the user has been informed that the write succeeded. This is less
    disastrous than losing the whole calendar, but it's worth fixing as
    well.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#440 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@untitaker
Copy link
Contributor

python-atomicwrites 1.1.0 solves this issue.

@Unrud
Copy link
Collaborator

Unrud commented Jul 26, 2016

python-atomicwrites 1.1.0 solves this issue.

Collection.delete is still not durable.

@untitaker
Copy link
Contributor

Oh right.

Unrud pushed a commit to Unrud/Radicale that referenced this issue Aug 1, 2016
Unrud pushed a commit to Unrud/Radicale that referenced this issue Aug 1, 2016
Unrud pushed a commit to Unrud/Radicale that referenced this issue Aug 2, 2016
Unrud pushed a commit to Unrud/Radicale that referenced this issue Aug 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants