Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restic uses mtime to detect file changes, which can miss changes. #2179

Closed
d3zd3z opened this issue Feb 21, 2019 · 34 comments · Fixed by #2212
Closed

Restic uses mtime to detect file changes, which can miss changes. #2179

d3zd3z opened this issue Feb 21, 2019 · 34 comments · Fixed by #2212
Labels
type: feature enhancement improving existing features

Comments

@d3zd3z
Copy link
Contributor

d3zd3z commented Feb 21, 2019

Output of restic version

restic 0.9.4 compiled with go1.11.4 on linux/amd64

How did you run restic exactly?

see below for the exact commands I used. Repo is local, and no other arguments are given to the backup command.

What backend/server/service did you use to store the repository?

Local.

Expected behavior

Follow the script, and expect restic to backup the given file after it has been modified.

Actual behavior

"Files: 0 new, 0 changed, 4 unmodified"

Steps to reproduce the behavior

echo "Hello world" > a.txt
echo "hELLO WORLD" > b.txt
touch stamp
cat a.txt > hello.txt
touch -r stamp hello.txt
restic -r /tmp/test-repo -p a.txt init
restic -r /tmp/test-repo -p a.txt backup .
sleep 10
cat b.txt > hello.txt
touch -r stamp hello.txt
restic -r /tmp/test-repo -p a.txt backup .

Do you have any idea what may have caused this?

mtime should not be used to determine if a file needs to be backed up, ctime should be used. Worst case with ctime is that restic will needlessly read/hash the file to determine if it has changed. By using mtime it can skip backing up a file.

I have seen the Debian package manager, specifically, replace a file with a different file, and put the mtime back to the same value.

If I add -f to the backup command, the file will indeed be backed up.

Do you have an idea how to solve the issue?

Ideally, use ctime. Maybe use both, or provide an option, but ctime is really what should be used. It would probably have to revert to mtime on a filesystem that doesn't have a ctime.

Did restic help you or made you happy in any way?

It makes me happy that it otherwise seems to reliably back up my files.

@aldem
Copy link

aldem commented Mar 14, 2019

Using mtime introduces another issue - changes in metadata only, like owner, group, modes, POSIX ACLs and any of extended attributes are not detected at all.

Option -f helps, but it takes ages to backup huge filesystem, as it simply is equivalent to initial backup - for instance, in my case (ca. 35G of data, over 600K files) it takes ca. 1.5h to make first backup, then less than 10m for every snapshot, but with -f it always runs more than 1h.

@fd0
Copy link
Member

fd0 commented Apr 23, 2019

Using mtime introduces another issue - changes in metadata only, like owner, group, modes, POSIX ACLs and any of extended attributes are not detected at all.

Hm, what do you mean? All the checks mentioned so far (mtime, ctime as implemented in #2212) are only used to determine whether or not a file needs to be re-read. The metadata including ACLs is always freshly loaded and written to the repo. If it hasn't changed since the last backup, restic's deduplication takes care that it isn't saved again. If anything has changed, the metadata is saved in the repo.

Do I miss anything?

@fd0 fd0 added type: bug type: feature enhancement improving existing features and removed type: bug labels Apr 23, 2019
@aldem
Copy link

aldem commented Apr 23, 2019

@fd0 What I mean is that if only metadata has changed, mtime is not modified at all, thus any changes in metadata only will not be carried out to backup, as file is not recognized as changed at all, at least it was like this in 0.9.4 and still like this in 0.9.5 (just tried). I have recorded session that demonstrates this (using chown and backup again).

Once #2212 is implemented, then it should work (I hope), but in current released versions it does not, unless --force is used (but forcing re-read of everything is too expensive).

@fd0
Copy link
Member

fd0 commented Apr 23, 2019

@aldem oh wow, now I understand and I'm able to reproduce it. I think that is a (separate) bug (and a regression) of the new archiver code introduced with 0.9.0. I'll investigate...

@fd0
Copy link
Member

fd0 commented Apr 23, 2019

It's a regression, it works correctly with 0.8.3. I'll open a new issue and fix it.

@fd0
Copy link
Member

fd0 commented Apr 23, 2019

This is tracked as #2249

@aldem
Copy link

aldem commented Apr 23, 2019

@fd0 Thank you, this was a real show-stopper.

I have found this ticket and then #2212 so I thought this is not implemented yet, that's why I didn't file a bug report.

@fd0 fd0 closed this as completed in #2212 Apr 25, 2019
@DurvalMenezes
Copy link

DurvalMenezes commented May 24, 2019

Please excuse me, but shouldn't we be giving more warning to users about this problem? Something along the lines of "ATTENTION: all your backups made with restic <= 0.9.5 could be missing changed files!".

I only found about this after being hard bitten (costing me almost a full day of work, and a lot of peace-of-mind, trying to hunt down apparent corruption between a saved ZFS snapshot of a directory and its restic restore'd copy)

I think the seriousness of this goes beyond just driving crazy someone that checks everything (ie, SHA checksum for restored files) like me, as changes that should have been backed up are being missed – if anyone needs to recover one of these files from backup (ie, operator error, disaster recovery, etc) he/she will, without any warning, just get an older, outdated file :-/ and in such a scenario, ie after the original file is lost, it is lost forever: there is obviously no way to recover it from a restic backup 😦

I made a separate post about it in the forum, but I think this should figure more prominently somewhere, perhaps in restic's website or even at the download page.

@alphapapa
Copy link

Can we talk about this a bit further please?

Isn't it standard behavior for Unix backup tools to use mtime to detect changed files?

Obviously, if you change File A's data, and then set its mtime to what it was before the data changed, the file appears unchanged. That seems like a classic case of, "Doctor, it hurts when I poke myself in the eye."

Now, also obviously, there are a wide variety of tools in the wild, and some of them might do The Wrong Thing like that. So, indeed, Restic ought to have an option to consider ctime or other timestamps to determine whether a file has changed. Having that option is unquestionably a good thing.

But it seems bogus to me to change the default. AFAIK, this is unexpected, non-standard behavior for a Unix backup tool, and as noted in #2495, changing the default is having significant, undesirable consequences.

As well, this change was made in a "Z" release (as in SemVer-style X.Y.Z versioning), which ought to be reserved for bug fixes, not changing default behavior. #2495 is just one example of Restic users with huge amounts of data (14 TB in that one). I think Restic users ought to be able to expect that upgrading from 0.9.x to 0.9.y will not change any behaviors except to fix bugs.

And this issue was almost certainly NOT a bug, because any program that purposely changes the mtime of files whose data have changed to an earlier timestamp is almost certainly doing The Wrong Thing and should expect to have that effect on backup tools.

So, I think the default behavior should be changed back to using mtime, which seems standard for Unix backup tools.

If I'm wrong about mtime being the standard for Unix backup tools, perhaps a survey would be in order.

Thanks.

@smlx
Copy link
Contributor

smlx commented Dec 7, 2019

SemVer says:

  1. Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable.

@aldem
Copy link

aldem commented Dec 7, 2019

@alphapapa Actually, some "standard" *ix backup tools like tar (though there is no official standard, to be precise) do check ctime, since without checking ctime it would not be possible to detect metadata-only changes (ownership, modes etc).

@d3zd3z
Copy link
Contributor Author

d3zd3z commented Dec 9, 2019

@alphapapa wrote:

Isn't it standard behavior for Unix backup tools to use mtime to detect changed files?

No, everything I'm aware of that backs up correctly uses ctime. mtime is pretty worthless for backups, as it can be set to an arbitrary value, and often is. Just unpacking a tar file will result in files with the mtime set to what they were before.

Backups done rsync style do use the mtime (but not against an increment, just to see if it is different), and rsync does miss files that have changed.

@fd0
Copy link
Member

fd0 commented Dec 10, 2019

@alphapapa I've considered this a bug fix because with mtime restic can lose data (file has changed but mtime was reset) and not pick up metadata-only changes. Both is very undesirable IMHO. I did not expect that there are so many cases in which restic now re-reads data. Hmhmhm.

If I'm wrong about mtime being the standard for Unix backup tools, perhaps a survey would be in order.

At least borg detects changes based on ctime, size and inode by default: https://borgbackup.readthedocs.io/en/stable/usage/create.html

@MarkMielke
Copy link

There have been a few assertions made in here that aren't necessarily right, but aren't necessarily wrong and I've debated myself on whether to challenge them...

A first assertion I would like to make is that neither ctime nor mtime are guaranteed to be reliable methods of determining if a change has occurred. This was especially true in the past when these timestamps had second granularity (or perhaps worse, although I luckily never faced that). This means that many tools beyond backup tools are affected by this, including the long-used "make" command for building software. This has improved with file systems now having microsecond or better granularity, but I think it's an entirely false premise to believe that timestamp alone can ever detect whether a change has definitely occurred or not. There will always be edge cases, and the best you can do is try to reduce the edge cases, although this often comes at a performance cost.

If you truly want to know if content changed, you need to check the content. This applies to both data and metadata. This is why commands such as "rsync" have had the ability to check the content from as early on as I can remember. It's always the option of whether you can't trust the timestamps, in which case you should do a full comparison of all content, or you can trust the timestamps, in which case certain operations can be skipped to optimize the process.

ctime checking is an example of such a compromise. By adding ctime to the list of checks, you significantly reduce the already low chance of failure, but the cost is that you suddenly add in a number of false positives. ctime update does not imply that the content has changed, nor that any data recorded in the backup has changed. It only indicates that "something about the inode has changed".

The main issue I have with ctime, is that what ctime is checking is data which can be queried in full at low cost, and this is what rsync does. I mean that ctime will update if metadata changes, but checking the metadata for changes directly is always the better check, than checking ctime. I would likely never write code that says "if ctime hasn't updated, then skip checking the owner, group, size or inode number". The stat() calls returns all this information, and it is readily available. Checking the metadata is about equal in cost, and more reliable than checking whether the timestamp for the metadata has updated. So, I don't really consider the ctime update as proof of anything except that the system marked the inode has having been updated, and if we are already checking the metadata we care about, this isn't really valuable information.

The one exception seems to be the case of a "restore"-like command, such as the above mentioned "tar", restoring the mtime into the past, which has the cascade effect of causing the backup to see an older timestamp. In my experience, I have not found this to be as problematic as described. The timestamp is updated after filling the file with data, and any comparison to timestamp should use "not equals" rather than "greater than" or "less than", so I don't actually see the case problem case mentioned earlier as being a real concern. In fact, I'm tempted to argue the opposite - if the data was restored, perhaps it should be skipped. Although, this should be a explicit decision of the person doing the manipulations.

I also thinks it is worth mentioning that there are "restore" tools that can restore "ctime" as well as "mtime". For example, in several of my use cases - we make heavy use of LVM thin volume snapshots, possibly with application coordination to "quiesce" the data (generally, flush it to disk and pause some types of updates), mount particular snapshots to a "backup" mount point, and then run Commvault (current system) against the "backup" mount point. We want to switch this to Restic. This extra context is to explain that using "ctime" and "inode number" are POSIX file system aspects, that are not necessarily preserved or interpreted according to the expectations being placed on them. Other systems would include replication technologies. This is basically why restic has a "--ignore-inode" option, because often these backend systems don't even pretend to honor the type of interpretation that is being applied, and this makes any checks against ctime invalid.

I think there are legitimate reasons to monitor ctime, and legitimate reasons to ignore ctime. This is not a case where one camp is definitely right and the other camp is definitely wrong. This is a case where it is important to understand how your data is created, and how it is updated, in order to understand what the correct compromise is between efficient and reliable backups is.

In my cases so far, "--ignore-inode" meets my requirements. In real-life, we will probably use "--ignore-inode" in most if not all real-life applications of Restic. We don't want the "ctime" checking behaviour. I don't believe that rsync misses as many cases as people have suggested, nor do I believe the ctime solves this problem 100%. I think this is an extremely conservative view of the world. This conservative view might be valid if you do not understand how the data is being created and updated, or if you think that the performance cost is worth the additional checking. In my view, for our real-life production data, I do not agree that the performance cost is worth the additional checking, and I will be advising that "--ignore-inode" be carefully considered, and recommended in all cases unless you also want to monitor for specific exceptional cases that certain people have warned about.

@aldem
Copy link

aldem commented Dec 17, 2019

@MarkMielke

Checking the metadata is about equal in cost, and more reliable than checking whether the timestamp for the metadata has updated.

You forgot about POSIX ACLs and extended attributes in general - this is not returned by stat() call, but even all inode information that is returned by stat() still requires additional comparisons, so it is only relatively "low cost" (it has to be parsed/deserialized and compared one-by-one). Multiply this by millions of files and get the picture...

As to time resolution - yes, in some edge cases (non-)change of ctime (on ancient filesystems lacking sub-second resolution) may miss actual updates, but highly unlikely, even when backup runs once per minute the odds that specific change will hit same second only during previous backup and not in between are quite low, maybe on very heavily loaded systems (but if the system is heavily loaded then your content-comparison backup will most likely kill its performance completely).

I don't believe that rsync misses as many cases as people have suggested, nor do I believe the ctime solves this problem 100%

I can't speak for everyone, but for more than 15 years of using rsync with ctime+mtime checks, I never had an issue with missed changed (meta)data (and that are petabytes of data and billions of files synced), while with restic (when ctime was ignored) I had noticed issue almost instantly (when owner/mode changes were missed), and workaround (comparing content always) significantly increased I/O and backup time even on relatively low-volume backup.

You are lucky if your backup runs for less than one hour (when always comparing the content), but if it needs 8-16 hours and you have to do it every day at least once you will quickly find out that this is a huge stress on the system (= everything is extremely slow), while ctime (especially with micro- or nanosecond resolution present in modern file systems) almost completely eliminates the chance to miss the change in at least metadata, and I am not aware of any method (at least on Linux) that is able to manipulate ctime directly (excluding disk imaging, of course), thus making it a quite good indicative of metadata/data changes.

The main advantage is that in most practical cases ctime helps to avoid comparison of metadata or content, thus significantly (orders of magnitude) reducing backup time.

Yes, I agree that there should be options to tune restic behavior (probably even extending them to use different methods based on paths/patterns), but in any case the "default" (tar/rsync-like) should be based on ctime+mtime and I believe that it works as expected "out of the box" for 99% of users.

@d3zd3z
Copy link
Contributor Author

d3zd3z commented Dec 17, 2019

The one exception seems to be the case of a "restore"-like command, such as the above mentioned "tar", restoring the mtime into the past, which has the cascade effect of causing the backup to see an older timestamp. In my experience, I have not found this to be as problematic as described.

I've encountered it numerous times. Although, with git, this seems less common, I used to unpack tarballs of files a lot.

As far as rsync failures. To clarify, I don't believe rsync ever uses the ctime. It is only comparing two trees, and since the ctime cannot be set, it can't set it to the same value as the source file. It only compares mtime and possibly other attributes. The time I had it fail was when a debian package replaced a compressed file with a re-compression of the same file. The mtime was set to the time of the original file, and it re-compressed to the same size. But since the gzip header has a timestamp, the contents were different. In this case it didn't matter too much, except when I asked dpkg to verify the contents of installed packages, and the file hash was wrong.

@MarkMielke
Copy link

I accept your point on access controls and extended attributes being more expensive to query that just lstat(). I would, however note that this is a case of "better to be safe" vs "performance", and you are choosing performance. I mean, that when I use rsync and specify the -avHAXS flags that I am so familiar with typing, I am choosing to pay this cost, whereas you are choosing not to. You also agreed that ctime in the past with second granularity had a level of risk, but you are willing to define the risk as quite low. So, our definitions of comfort are much more grey than they are aligned or polar opposites. :-)

I can't speak for everyone, but for more than 15 years of using rsync with ctime+mtime checks, I never had an issue with missed changed (meta)data (and that are petabytes of data and billions of files synced), ...

Interesting that you say that, as my experience is the same... with one exception. Rsync does not use ctime. I was pretty sure it didn't, but I've just checked the source to be absolutely sure, and it does not. There are zero mentions of st_ctim or st_ctime, which is the field that is pulled from lstat() to determine the ctime of the file. There are several mentions of st_mtime, which get pulled into the file->modtime data structure, where the file structure also has no space for ctime.

So, this leads me to be concerned about this point:

... while with restic (when ctime was ignored) I had noticed issue almost instantly (when owner/mode changes were missed), and workaround (comparing content always) significantly increased I/O and backup time even on relatively low-volume backup.

This seems to be the real problem you were facing! This makes me suspect you were dealing with race conditions, or some other problem. rsync also has race conditions, but that's why you typically run it more than once to collect the updates "since" it scanned that part of the directory, or if you are feeling especially pedantic as I sometimes am, you take a file system snapshots, and use rsync or restic on the snapshot to ensure file system consistency.

@aldem
Copy link

aldem commented Dec 18, 2019

Rsync does not use ctime.

Well, it was my guess then that rsync uses ctime to detect metadata changes as they were always picked up, though the rest is surely was done by mtime (in my data sets there is nothing that deliberately manipulates mtime, it is updated only on content updates).

But even in case if we resort to only metadata comparisons (assuming that mtime is reliable to detect content changes) savings are quite huge. I really could not afford to wait 8 hours every time when backup has to be done, and you are right - I am willing to accept the risk, especially now when time resolution is on nanosecond-scale.

Honestly, I could not even imagine that two changes will be made within same nanosecond even on RAM-disk, accounting for all that I/O handling overhead (syscalls, context switches etc).

This makes me suspect you were dealing with race conditions, or some other problem.

No race conditions, it was really simple - backup was made once, then few files get mode/acl changes (made few minutes after backup, so definitely no [cm]time resolution problem), and it didn't produce any activity on next backup run some time later. It turned out to be a bug in restic though, as it didn't compare metadata nor was checking ctime.

@d3zd3z
Copy link
Contributor Author

d3zd3z commented Dec 18, 2019

Rsync does not use ctime.

Rsync doesn't use ctime because it can't. It is about synchronizing two directories. Since the ctime cannot be set, it doesn't make any sense for it to compare it with anything.

This is very different than something like restic (or pretty much any other backup solution), where the ctime is being stored. In this case, it makes to compare the ctime with what is stored in the backup. I would go so far as to argue that comparing the ctime is the only way to correctly back everything up, and is why, as far as I can tell, every backup system (other than rsync, which I wouldn't really call backup) uses the ctime to determine what to backup.

Rsync is a rather different tool than Restic.

@MarkMielke
Copy link

No race conditions, it was really simple - backup was made once, then few files get mode/acl changes (made few minutes after backup, so definitely no [cm]time resolution problem), and it didn't produce any activity on next backup run some time later. It turned out to be a bug in restic though, as it didn't compare metadata nor was checking ctime.

Yes, that is a huge bug. On its own, without any other factor. It should be checking either the metadata, or I suppose at least the metadata timestamp (= ctime). :-)

@MarkMielke
Copy link

... and is why, as far as I can tell, every backup system (other than rsync, which I wouldn't really call backup) uses the ctime to determine what to backup.

I use rsync for backup much more than any other tool, although it's often used as a component in a larger system. Most traditional backups systems are very poor, and take hours and hours to backup, whereas rsync can complete in seconds or less even for large file systems, if done with some awareness of how rsync works internally.

The whole reason I'm here, is because I'm of the belief that Restic is a bit different from traditional backup systems, and I want to see this belief proven true, and perhaps I can use rsync less, and Restic more.

For example of a very common use case for me - we will use rsync to copy a local file system to a remote NFS file system, that is itself snapshot and backed up. But other examples include use rsync to get data from a local file system on a production server to a local file server on a standby server, and then take the backup on the standby server, so that the backup process itself (that might take several hours, especially if it is an application-based backup) does not create performance overhead on the production server. (Sometimes we get fancier than this... and we clone the iSCSI volume that hosts the data, and mount this on the standby server, and back this up...)

My point in all this, is that it's really easy for any of us - myself included - to look at our own experiences, and make easy and fast conclusions about how we do things, and how other people must not be doing it correctly. But, if you don't the requirements of the other people - it's hard to really say whether they are doing it incorrectly or not. There is more than one answer to this question.

@alphapapa
Copy link

@d3zd3z

Rsync does not use ctime.

Rsync doesn't use ctime because it can't. It is about synchronizing two directories. Since the ctime cannot be set, it doesn't make any sense for it to compare it with anything.

This is very different than something like restic (or pretty much any other backup solution), where the ctime is being stored. In this case, it makes to compare the ctime with what is stored in the backup. I would go so far as to argue that comparing the ctime is the only way to correctly back everything up, and is why, as far as I can tell, every backup system (other than rsync, which I wouldn't really call backup) uses the ctime to determine what to backup.

Rsync is a rather different tool than Restic.

If I may play devil's advocate for a moment, to help me think more clearly about these tools:

How is Restic fundamentally different from Rsync? Rsync syncs two directory trees, whether remote or local. Restic effectively syncs two directory trees, one of them mounted locally and the other being a virtual tree stored in the Restic backup repo. Of course, there are a million options, and Rsync is a very flexible and powerful tool. It can even make backups (real ones) by using hardlinks on the destination. But fundamentally, aren't they doing the same thing: syncing two directory trees?

If so, then by that logic, if Rsync doesn't use ctime, why should Restic? Is it just an optimization to store ctime and compare that instead of other metadata?

Thanks to you and @MarkMielke for your enlightening discussion here.

@aldem
Copy link

aldem commented Dec 18, 2019

If so, then by that logic, if Rsync doesn't use ctime, why should Restic?

rsync doesn't because it can't set it to specific value at target filesystem, while restic stores ctime value in archive, so it could be compared.

As I have mentioned before, comparing metadata of millions of files (without comparing content) could be quite expensive.

@d3zd3z
Copy link
Contributor Author

d3zd3z commented Dec 18, 2019

If so, then by that logic, if Rsync doesn't use ctime, why should Restic? Is it just an optimization to store ctime and compare that instead of other metadata?

Rsync doesn't use ctime because it can't, not because it shouldn't. Another example would be Unison, which does use ctime. It also stores a database for each side that stores the file's metadata (mostly ctime) so that it can tell when a file changes.

The fundamental difference is that restic makes multiple snapshots of a filesystem, and stores all of them. rsync attempts to synchronize one directory with another, without storing any other data. It does the best that it can without storing anything, but because it has no way to know what the ctime was before, it really can't perfectly know if something has changed.

Restic isn't "syncing", it is making a snapshot. It works fine without referencing the old backup, and even should store the same result (because of deduplication). Since we are able to store the ctime, that can be compared against the source to make this optimization more robust than just guessing based on other parameters.

@MarkMielke
Copy link

As far as rsync failures. To clarify, I don't believe rsync ever uses the ctime. It is only comparing two trees, and since the ctime cannot be set, it can't set it to the same value as the source file. It only compares mtime and possibly other attributes. The time I had it fail was when a debian package replaced a compressed file with a re-compression of the same file. The mtime was set to the time of the original file, and it re-compressed to the same size. But since the gzip header has a timestamp, the contents were different. In this case it didn't matter too much, except when I asked dpkg to verify the contents of installed packages, and the file hash was wrong.

Ewww. :-) Bad compression program. :-)

@MarkMielke
Copy link

MarkMielke commented Dec 18, 2019

Restic isn't "syncing", it is making a snapshot. It works fine without referencing the old backup, and even should store the same result (because of deduplication). Since we are able to store the ctime, that can be compared against the source to make this optimization more robust than just guessing based on other parameters.

These are semantics. :-)

I love it when disruptive technology like Git totally re-invents how people think (including how it likely shaped Restic), but how fundamentally it's really about some simple concept like getting data from one place to another as efficiently as possibly without breaking it.

Git commits are file system snapshots. You can argue whether or not creating a Git commit is synchronizing or snapshoting, but the effect is really the same. You are capturing state from one system, and describing it in another system, in such a way that you could reproduce the original system +/- some artifacts. Amusingly to me... Git also does not store ctime. :-)

@d3zd3z
Copy link
Contributor Author

d3zd3z commented Dec 18, 2019

Amusingly to me... Git also does not store ctime. :-)

Also, not correct. The git index stores the ctime (and mtime) of each file, and if the ctime changes, it rehash the files. It's behavior is pretty much identical to how Restic does it. Git index format.

@MarkMielke
Copy link

MarkMielke commented Dec 18, 2019

My mistake. Sorry. :-) The pretty print of it excludes it. :-)

(Although, this begs the question of what it really gets used for... since if it truly rehash the file for every new workspace, Git wouldn't actually work... research required...)-

UPDATE: Index is only used for quickly detecting changes to the working tree. It also has similar options such as "trustctime", with an amusing twist:

       core.trustctime
           If false, the ctime differences between the index and the working tree are ignored; useful when the inode change time is regularly modified by something outside Git (file
           system crawlers and some backup systems). See git-update-index(1). True by default.

And in git-update-index:

       The command also looks at core.trustctime configuration variable. It can be useful when the inode change time is regularly modified by something outside Git (file system
       crawlers and backup systems use ctime for marking files processed) (see git-config(1)).

Apparently some backup systems update ctime? :-) Eww....

@d3zd3z
Copy link
Contributor Author

d3zd3z commented Dec 18, 2019

(Although, this begs the question of what it really gets used for... since if it truly rehash the file for every new workspace, Git wouldn't actually work... research required...)-

If you copy a git workspace somewhere else (or restore from backup), git indeed will rehash every file.

@d3zd3z
Copy link
Contributor Author

d3zd3z commented Dec 18, 2019

Apparently some backup systems update ctime?

For example, gnu tar has an option --preserve-atime, which after accessing the file, sets the atime, which has the consequence of updating the ctime.

I'm not aware of anything that uses the ctime to mark files processed, or how that would even work. I'm guessing they modify mtime or atime, which has the consequence of updating the ctime.

@alphapapa
Copy link

alphapapa commented Dec 18, 2019

If so, then by that logic, if Rsync doesn't use ctime, why should Restic? Is it just an optimization to store ctime and compare that instead of other metadata?

Rsync doesn't use ctime because it can't, not because it shouldn't. Another example would be Unison, which does use ctime. It also stores a database for each side that stores the file's metadata (mostly ctime) so that it can tell when a file changes.

Unison is an especially interesting example, since it uses the Rsync transfer protocol (though not the Rsync change-detection algorithm). From its manual:

Fast Update Detection
If your replicas are large and at least one of them is on a Windows system, you may find that Unison's default method for detecting changes (which involves scanning the full contents of every file on every sync—the only completely safe way to do it under Windows) is too slow. Unison provides a preference fastcheck that, when set to true, causes it to use file creation times as 'pseudo inode numbers' when scanning replicas for updates, instead of reading the full contents of every file.

When fastcheck is set to no, Unison will perform slow checking—re-scanning the contents of each file on each synchronization—on all replicas. When fastcheck is set to default (which, naturally, is the default), Unison will use fast checks on Unix replicas and slow checks on Windows replicas.

This strategy may cause Unison to miss propagating an update if the modification time and length of the file are both unchanged by the update. However, Unison will never overwrite such an update with a change from the other replica, since it always does a safe check for updates just before propagating a change. Thus, it is reasonable to use this switch most of the time and occasionally run Unison once with fastcheck set to no, if you are worried that Unison may have overlooked an update.

Fastcheck is (always) automatically disabled for files with extension .xls or .mpp, to prevent Unison from being confused by the habits of certain programs (Excel, in particular) of updating files without changing their modification times.

So many misbehaving programs. :)

The fundamental difference is that restic makes multiple snapshots of a filesystem, and stores all of them.

Rsync can do that with hardlink backups, each of which is a snapshot.

rsync attempts to synchronize one directory with another, without storing any other data. It does the best that it can without storing anything, but because it has no way to know what the ctime was before, it really can't perfectly know if something has changed.

I wonder if anyone has implemented some kind of "ctime cache" for Rsync to speed up comparisons of large trees.

Restic isn't "syncing", it is making a snapshot. It works fine without referencing the old backup, and even should store the same result (because of deduplication).

Isn't making a snapshot fundamentally syncing data from its source to its destination in the snapshot? Logically it is syncing data from one place to another, regardless of the formats. Imagine mounting a Restic snapshot with FUSE and then running Rsync against it (hmm, that could be a useful way to verify the content of a snapshot after making one).

Since we are able to store the ctime, that can be compared against the source to make this optimization more robust than just guessing based on other parameters.

So it's just an optimization that depends on the filesystem behaving as expected, right?

@geri777
Copy link

geri777 commented Jan 26, 2020

I came here because I would like it to ignore ctime. I've found the --ignore-inode switch, but it does not help regarding the ctime.

My suggestion to all this discussion would be to add a switch like:
--ignore-ctime

@rawtaz
Copy link
Contributor

rawtaz commented Jan 26, 2020

@geri777 As that is a feature request separate from this issue, please open a new issue (choose "Feature request" as the type when asked about it), and in there fill out the template. Please explain the use case for the request as well.

@scheuref
Copy link

scheuref commented Oct 25, 2022

Honestly, I could not even imagine that two changes will be made within same nanosecond even on RAM-disk, accounting for all that I/O handling overhead (syscalls, context switches etc).

I agree. And in order to miss a changed file, you will need two conditions:

  1. a file is changed twice is the same nanosecond
  2. restic reads the content of this file a the same nanosecond as well, after the 1st write and before the 2nd write.

The probability of 1. is almost zero.
The probability of 2. is even much more smaller.
The resulting probability is prob(1.) x prob(2.), which is hence quasi null.

There is a special case, if an application was designed to update the same file continuously, many times per nanosecond.
In this case prob(1.) and prob(2.) would be both 100%, but this is still a non-issue, because restic would upload this file during each backup correctly.

The issue would only arise if the application would stop the continuous rewriting of this file at the exact nanosecond when the restic scan reads that file.
Then restic would never upload the latest version.
But again this is only possible theoretically but not practically.

Another way to track changes efficiently would be to let restic access local filesystem snapshots.
Cf. https://bp.veeam.com/vbr/Support/S_Agents/val.html with a kernel module.
Or zfs or lvm snapshots?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature enhancement improving existing features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants