Data loss due to non-atomic writes #440

kentonv · 2016-07-01T17:33:54Z

storage.py contains many clauses that update files by overwriting them, like:

with open(path, "w", encoding=self.storage_encoding) as fd:
    fd.write(item.serialize())

Unfortunately, if a crash or machine failure occurs between the first line and the second line, the result is that the item will be completely lost, since open() erases all existing data and the data isn't replaced until the write.

Note that on a typical Linux system, there may be a full 30 seconds between the call to write() and the data actually being flushed to disk. Therefore, after this code, there is a 30-second window in which a power failure will cause data loss.

This is a real problem for people running Radicale on https://oasis.sandstorm.io: In such large distributed systems, "power failures" -- or events that look a lot like power failures, such as losing the network connection to the storage volume -- are common. We've now had two independent reports of Radicale losing people's calendars on Oasis.

To fix this problem, Radicale should always write data to a temporary file first, and then os.rename() the temporary over the original. Technically it should also do os.fsync(f.fileno()) immediately before the rename (although on Linux / ext4 systems this is not strictly necessary).

The text was updated successfully, but these errors were encountered:

untitaker · 2016-07-13T14:14:29Z

python-atomicwrites handles this usecase although it might be overkill

liZe · 2016-07-13T14:16:42Z

@untitaker I've JUST RIGHT NOW found your project too. What a timing.

liZe · 2016-07-13T14:17:47Z

(Or you've just seen my last comment about this in #372, magic's not always magic.)

liZe · 2016-07-13T23:17:20Z

I've used atomicwrites for now. It works, but I'd be happy to have real life tests.

kentonv · 2016-07-14T19:14:20Z

Thanks!

I'm not sure how to test this kind of problem, but I'll certainly let you know if there are any other reports of data loss after this update.

Unrud · 2016-07-17T15:53:34Z

This issue is not solved. AtomicWriter does not provide durability on non-windows systems (see untitaker/python-atomicwrites#12). Collection.delete also does not ensure that the changes are really written to disk.

It's still possible that deleted events reappear in clients and that created events or changes to existing events disappear. Only broken (partially written) items can't appear.

kentonv · 2016-07-17T22:30:19Z

To be clear:

The fix as-implemented currently will ensure that people don't randomly lose their whole calendar unexpectedly, because the update is now atomic -- there's no point in time where a crash would lead to the file (both old and new versions) being totally lost.
However, until you fsync() the directory, it is perhaps possible that the replacement operation hasn't been committed to disk. A power outage or device disconnect at this point could mean that the modification is lost, which is bad if the user has been informed that the write succeeded. This is less disastrous than losing the whole calendar, but it's worth fixing as well.

untitaker · 2016-07-19T13:19:55Z

It is problematic because a "200 OK" could've been returned shortly before the system crashed and reverted the changes.

Unfortunately I don't have access to a PC at the moment.

On 18 July 2016 00:30:20 CEST, Kenton Varda notifications@github.com wrote:

To be clear:

The fix as-implemented currently will ensure that people don't
randomly lose their whole calendar unexpectedly, because the update is
now atomic -- there's no point in time where a crash would lead to the
file (both old and new versions) being totally lost.

However, until you fsync() the directory, it is perhaps possible that
the replacement operation hasn't been committed to disk. A crash at
this point could mean that the modification is lost, which is bad if
the user has been informed that the write succeeded. This is less
disastrous than losing the whole calendar, but it's worth fixing as
well.

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#440 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

untitaker · 2016-07-26T22:30:36Z

python-atomicwrites 1.1.0 solves this issue.

Unrud · 2016-07-26T22:54:31Z

python-atomicwrites 1.1.0 solves this issue.

Collection.delete is still not durable.

untitaker · 2016-07-26T22:55:33Z

Oh right.

See Kozea#440

kentonv mentioned this issue Jul 1, 2016

Non-atomic writes possibly causing (rare) data loss synchrone/sandstorm-radicale#21

Closed

liZe added the bug label Jul 4, 2016

liZe mentioned this issue Jul 13, 2016

Follow roadmap for 2.0 #372

Closed

39 tasks

liZe closed this as completed in 4c91ee8 Jul 13, 2016

Unrud pushed a commit to Unrud/Radicale that referenced this issue Aug 1, 2016

delete atomic and durable

25905c9

See Kozea#440

Unrud pushed a commit to Unrud/Radicale that referenced this issue Aug 1, 2016

delete atomic and durable

da5b9bd

See Kozea#440

Unrud mentioned this issue Aug 1, 2016

Delete atomic and durable #453

Merged

Unrud pushed a commit to Unrud/Radicale that referenced this issue Aug 2, 2016

delete atomic and durable

81d9e34

See Kozea#440

Unrud pushed a commit to Unrud/Radicale that referenced this issue Aug 5, 2016

delete atomic and durable

5a9d956

See Kozea#440

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loss due to non-atomic writes #440

Data loss due to non-atomic writes #440

kentonv commented Jul 1, 2016

untitaker commented Jul 13, 2016

liZe commented Jul 13, 2016

liZe commented Jul 13, 2016

liZe commented Jul 13, 2016

kentonv commented Jul 14, 2016

Unrud commented Jul 17, 2016 •

edited

Loading

kentonv commented Jul 17, 2016 •

edited

Loading

untitaker commented Jul 19, 2016

untitaker commented Jul 26, 2016

Unrud commented Jul 26, 2016

untitaker commented Jul 26, 2016

Data loss due to non-atomic writes #440

Data loss due to non-atomic writes #440

Comments

kentonv commented Jul 1, 2016

untitaker commented Jul 13, 2016

liZe commented Jul 13, 2016

liZe commented Jul 13, 2016

liZe commented Jul 13, 2016

kentonv commented Jul 14, 2016

Unrud commented Jul 17, 2016 • edited Loading

kentonv commented Jul 17, 2016 • edited Loading

untitaker commented Jul 19, 2016

untitaker commented Jul 26, 2016

Unrud commented Jul 26, 2016

untitaker commented Jul 26, 2016

Unrud commented Jul 17, 2016 •

edited

Loading

kentonv commented Jul 17, 2016 •

edited

Loading