New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide file sync filter config to ignore tilde lock files #990
Comments
Dropbox has an interesting doc on what files it does not sync : https://help.dropbox.com/installs-integrations/sync-uploads/files-not-syncing |
One more interesting point here: |
We agree on this principle to exclude some files from the synchronization. This can be a setting, which is set up by default to all files which start with |
We want to move on this topic : disable synchronization on certain files based on there name pattern. As a starting point, this will be with files name :
What are the first methods called when the user is creating a new file in a "watched" mounted folder? What is the software pipeline to process new files after their creation (after detection of a new file)? What is the best place to catch these new files created and filter them? Where is the best place to store and setup the file names filter (a regex list) ? In the config factory? Is there any impact on the encrypted file system that can bring any issue regarding this filtering mechanism? For example in case the virtual file system is automatically encrypting the files and the sync system is uploading encrypted chunks, that would bring inconsistencies between file system chunks uploaded and the actual corresponding "clear" files list. |
Any ideas, help or hints on this topic ? |
Hi @bitlogik
Good thing you bumped this issue, I completely forgot about your earlier comments 😅
This part is a different yet related mecanism: most desktop environments on linux automatically create a
We do too, although we are quite busy with other crucial changes at the moment.
A quick note here, parsec doesn't work like dropbox in the sense that it doesn't "watch" a directory. Instead, it implements its own file system and mount it using fuse/winfsp. Now the question is what do we want to do with those files? At the moment, we'd like to try the following approach:
This way we don't have to change the data model, only the synchronization and display rules.
Yes, the GUI hits the
That's true but irrelevant if the proposed approach is implemented as those files are going to be integrated in the local file system just like any other file.
Sounds good! An incremental approach could be:
That's a good point. While implementing the proposed solution, we'll have to remember the following use case:
Questions:
Yes, more thing our TODO list :) For the record, I'll also explain here another problem we'll face with the current approach: ignoring the synchronization of a file also means affecting the synchronization of its parent directory, as adding a new file also updates the parent manifest. This seems like a very tricky problem to me, considering the current implementation. |
After digging in the code, I identified a good way of filtering the sync : storage objects have a get_need_sync_entries() methods, called by SynContext ._load_changes() parsec-cloud/parsec/core/sync_monitor.py Line 126 in b735aed
I think if the user or manifest get_need_sync_entries methods can filter these files, they won't be uploaded to the server for backup, nor download for local editing/viewing. There are 3 kind of local storage (which have get_need_sync_entries) :
And according to the code fs/storage directory, there even chunk and local kinds of storage. What is manifest? user storage? Also I read somewhere in the code, there's at least one sqlite data base. (one for "workspace data", one for "workspace cache" and one for "user storage") How are they related? What's a "realm" ? These 3 storages seem to have "realms". What's it?
Yes, I get the issue. Maybe a better idea is to filter the file (ignoring some files) just at this update manifest change stage ? As a theoretical approach is to make parsec blind for these "temporary" files the earlier we can. To dig in this direction : when a user is creating (or the Office software in our case) a file in the mounted directory (winfsp). What is the pipeline inside parsec which are involved (processes path)? What is the first part of the program code which is "seeing" or detecting the new file? What is the process of procedures called when a new file is detected on local fs? With these details, we can make a more rational decisions to solve this current issue. Also it can prevent unexpected side effect, because we can think of how one thing affects the others. Ultimately, this better view on parsec would benefit our commitment to participate on the development of parsec, including others parts. Also we can even help to write some docs or schematics. I also think of a hybrid approach might be a good way :
The main point of this current issue is to prevent the uploading to an object storage of the filtered "temp" files. For the potential issue with many devices, we can let this aside at the beginning. At first stage, all filtering rules would be the same (assuming all users runs the same client version). This kind of change can lead to unexpected behaviors, I hope a detail answer to our questions can dramatically lower this possibility, still we'll tackle things one thing at a time. |
@bitlogik Thanks a lot for your investigation on this issue! I'll be working on this for the next few days, so I'll get back to you with a more detailed answer by the end of the week. |
Hi @bitlogik, sorry for the late reply !
Actually this method is only used to determine the entries in the storage that needs to be synchronized when connecting (or re-connecting) to the backend. So we need an entry point that is more general than that.
Most of the file system logic happens in the transaction files:
In our case, we want our filtered files to exist just like any other file in the system except they should never be synchronized with the remote. That means most the work has to be done in the
Actually, everything in the
A manifest is an atomic and immutable object storing the information for a particular entry of the system (file, directory, workspace, user, etc.)
I don't want to go to much into the details but those are local storage objects (local sqlite databases) used to store all the local manifests in persistent manner. Those storage objects can either be "cache" (storing data that is safely stored remotely) or "data" (storing data that only exists locally). Also those storage objects can either store user information or workspace information.
Realm is generic word for describing either a user or a workspace (or more precisely to identify a container for user/workspace information).
I think I can rephrase your questions as "At which level of the call chain do we want to implement the filtering?". And indeed there are many levels, from the mountpoint (fuse/winfsp) operations, to the remote loader which uploads the manifests and blocks of data to the cloud. We had some internal (and quite technical discussions) about it and we think that it has to go in those transactions I was mentioning earlier. More precisely, I think we might be able to implement this filtering by simply tweaking the
One extra thought: I'm thinking of applying an same kind of filtering for the changes downloaded from remote ("downsync"). This way, we make sure that local temporary files will never conflict with other temporary files uploaded by error by other users (for instance, a user with a different configuration). This would also work well with existing files.
I think we should aim at a solution where each client can have their own filtering scheme.
Yes, both the data and metadata is encrypted before being sent to the remote so it doesn't have access to this information.
Exactly: by filtering the temporary files out of the remote folder manifest, the client won't even download the corresponding file manifests.
Right, we could even add option to show/hide those (local) temporary files.
I understand your point, but the solution I detailed earlier should allow us to implement both local and remote filtering at once. Hopefully it shouldn't be too painful.
That's very nice to hear! Hopefully this reply should help clarify a bit how the project is structured. I should be able to implement and push a proof of concept by the end of the week, I can tag you as reviewer if you're interested :) |
Thanks for the details. Based on the details we understand, this way seems to be a good one, regarding the foreseen possible problems. This should enhance the UX in "disabling synchronization on certain files based on their name pattern", as expected. |
Multiple softwares (Excel, Open office, Emacs to name the more common ones ) rely on temporary files to mark a file as opened (typically opening
my_file.doc
in Microsoft Word creates a~$my_file.doc
lock file) an prevent concurrent edit on it.Currently there is no special treatment for those files whatsoever.
This has multiple drawbacks:
A simple solution to this would disable synchronization on certain files based on there name pattern.
Obviously the pattern should be configurable (given it could depend very much) and have a sensitive value by default (typically ignoring files with pattern
~$.*
)The text was updated successfully, but these errors were encountered: