Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide file sync filter config to ignore tilde lock files #990

Closed
touilleMan opened this issue Feb 21, 2020 · 12 comments · Fixed by #1359
Closed

Provide file sync filter config to ignore tilde lock files #990

touilleMan opened this issue Feb 21, 2020 · 12 comments · Fixed by #1359
Assignees
Labels
enhancement Improve functionality, potentially changing behavior
Projects

Comments

@touilleMan
Copy link
Member

touilleMan commented Feb 21, 2020

Multiple softwares (Excel, Open office, Emacs to name the more common ones :trollface: ) rely on temporary files to mark a file as opened (typically opening my_file.doc in Microsoft Word creates a ~$my_file.doc lock file) an prevent concurrent edit on it.

Currently there is no special treatment for those files whatsoever.
This has multiple drawbacks:

  • useless synchronization: Those file has no real value in the long run, so we are just slowing down the Parsec client and eating up more metadata (and data, but those files are typically really small so this is not a big deal) in the metadata server.
  • other devices may retrieve the lock file which would emit strange error (the software would see the lock file and claim the file is already opened, where it's not the case on the device)
  • lock file may contain configuration specific to the machine it has been created on, which would further lead to strange errors
  • multiple devices opening the same file at the same time may end up with a conflict on the lock file

A simple solution to this would disable synchronization on certain files based on there name pattern.
Obviously the pattern should be configurable (given it could depend very much) and have a sensitive value by default (typically ignoring files with pattern ~$.*)

@touilleMan touilleMan added this to Investigate in Dev Board Feb 21, 2020
@touilleMan touilleMan added the enhancement Improve functionality, potentially changing behavior label Feb 21, 2020
@touilleMan touilleMan changed the title Rethink file sync trigger Provide file sync filter config to ignore .~lock files Feb 25, 2020
@touilleMan touilleMan changed the title Provide file sync filter config to ignore .~lock files Provide file sync filter config to ignore tilde lock files Feb 25, 2020
@Max-7
Copy link
Contributor

Max-7 commented Mar 2, 2020

Dropbox has an interesting doc on what files it does not sync : https://help.dropbox.com/installs-integrations/sync-uploads/files-not-syncing

@touilleMan
Copy link
Member Author

One more interesting point here:
Microsoft Office creates a hidden tilde file every time it open a file, however this is not allowed if the file is located within a workspace the user has only Reader rights on.
This means we end-up with a read-only workspace that cannot be read due to those tilde files.

@touilleMan
Copy link
Member Author

Office error when trying to open a in workspace with reader access rights

Capture d'écran 2020-03-26 11 31 49

@bitlogik
Copy link
Contributor

bitlogik commented Apr 2, 2020

We agree on this principle to exclude some files from the synchronization. This can be a setting, which is set up by default to all files which start with $ or . or ~+tmp-extension ("temporary files" of Dropbox list). This would solve in a elegant way all the drawbacks mentioned. In a second time, this rule can extended to many others categories in this Dropbox publication.

@bitlogik
Copy link
Contributor

bitlogik commented May 7, 2020

We want to move on this topic : disable synchronization on certain files based on there name pattern. As a starting point, this will be with files name :

  • starting with "~$"
  • starting with ".~"
  • starting with "~" and ending with ".tmp"

What are the first methods called when the user is creating a new file in a "watched" mounted folder? What is the software pipeline to process new files after their creation (after detection of a new file)? What is the best place to catch these new files created and filter them?
Is there different code paths/pipeline when the user is creating the file in the mounted file system, or importing a new file using the GUI ? Anyway, we can assume the new tmp files will be only created directly in the mounted folder, not by the GUI.

Where is the best place to store and setup the file names filter (a regex list) ? In the config factory?

Is there any impact on the encrypted file system that can bring any issue regarding this filtering mechanism? For example in case the virtual file system is automatically encrypting the files and the sync system is uploading encrypted chunks, that would bring inconsistencies between file system chunks uploaded and the actual corresponding "clear" files list.
Also we can imagine the file system with many temporary filtered files will be dirty and very different from the uploaded backup image.
We don't know exactly how is performed the underlying management of the encrypted files chunks, as the architecture documentation is non existent (in TODO ™️ state), we just have to be sure that ignoring local "clear" files at some points, doesn't mess up the backup sync system and/or the encrypted file system behavior.

@bitlogik
Copy link
Contributor

bitlogik commented May 7, 2020

There is already a mecanism to filter this kind of things in FuseOperations.create :

Still, we didn't even succeed to trigger it. All files and folders creation, either by mounted folder or through the GUI (import), took in account the new added path, without any error. The "tick" for the sync is even present. Is it only to prevent an actual backend remote backup, and mocking everything is OK ?
image

@bitlogik
Copy link
Contributor

Any ideas, help or hints on this topic ?

@vxgmichel
Copy link
Contributor

vxgmichel commented May 14, 2020

Hi @bitlogik

Any ideas, help or hints on this topic ?

Good thing you bumped this issue, I completely forgot about your earlier comments 😅

There is already a mecanism to filter this kind of things in FuseOperations.create :
Still, we didn't even succeed to trigger it.

This part is a different yet related mecanism: most desktop environments on linux automatically create a Trash-XXX directory at the root of mounted devices to use as a recycle bin. So banning the creation of those directories is a simple way to disable this feature. The reason why you could create those directories is because (I assume) you used the application, which is not affected by this limitation (as it is not really useful).

We want to move on this topic

We do too, although we are quite busy with other crucial changes at the moment.

What are the first methods called when the user is creating a new file in a "watched" mounted folder? What is the software pipeline to process new files after their creation (after detection of a new file)? What is the best place to catch these new files created and filter them?

A quick note here, parsec doesn't work like dropbox in the sense that it doesn't "watch" a directory. Instead, it implements its own file system and mount it using fuse/winfsp.

Now the question is what do we want to do with those files? At the moment, we'd like to try the following approach:

  • the files are created, encrypted and integrated into the local parsec file system like any other file
  • their name is used to decide whether they should be synchronized with other devices or not
  • their name is used to decide whether they should be displayed in the GUI or not

This way we don't have to change the data model, only the synchronization and display rules.

Is there different code paths/pipeline when the user is creating the file in the mounted file system, or importing a new file using the GUI ?

Yes, the GUI hits the workspace_fs interface which then hits the transactions layer while the moutpoint system is implemented using winfsp/fuse operations that directly hits the transaction layer.

Anyway, we can assume the new tmp files will be only created directly in the mounted folder, not by the GUI.

That's true but irrelevant if the proposed approach is implemented as those files are going to be integrated in the local file system just like any other file.

Where is the best place to store and setup the file names filter (a regex list) ? In the config factory?

Sounds good! An incremental approach could be:

  • hard coded rules
  • then move those hard coded rules in the configuration
  • then expose this configuration in the parsec application

Is there any impact on the encrypted file system that can bring any issue regarding this filtering mechanism? For example in case the virtual file system is automatically encrypting the files and the sync system is uploading encrypted chunks, that would bring inconsistencies between file system chunks uploaded and the actual corresponding "clear" files list.

That's a good point. While implementing the proposed solution, we'll have to remember the following use case:

  • Device A filters .~ files
  • Device B does not filter .~ files
  • Device B creates and synchronizes a file called a.~
  • Device A recovers the a.~ file

Questions:

  • Should a.~ be displayed on device A?
  • Should a.~ be synchronized if device A changes its content?
  • What happens if a.~ already existed on device A?

We don't know exactly how is performed the underlying management of the encrypted files chunks, as the architecture documentation is non existent (in TODO tm state)

Yes, more thing our TODO list :)

For the record, I'll also explain here another problem we'll face with the current approach: ignoring the synchronization of a file also means affecting the synchronization of its parent directory, as adding a new file also updates the parent manifest. This seems like a very tricky problem to me, considering the current implementation.

@bitlogik
Copy link
Contributor

bitlogik commented Jun 9, 2020

After digging in the code, I identified a good way of filtering the sync : storage objects have a get_need_sync_entries() methods, called by SynContext ._load_changes()

need_sync_local, need_sync_remote = await self._get_local_storage().get_need_sync_entries()

I think if the user or manifest get_need_sync_entries methods can filter these files, they won't be uploaded to the server for backup, nor download for local editing/viewing.
Still, this is not clear to me all this storage filesystems correspond to, and their relations, what's the file processing pipeline?

There are 3 kind of local storage (which have get_need_sync_entries) :

  • manifest
  • user
  • worspace

And according to the code fs/storage directory, there even chunk and local kinds of storage.
I guess workspace if for the GUI, so we don't care for now. Maybe for filtering the display, at an other place in the code.

What is manifest? user storage? Also I read somewhere in the code, there's at least one sqlite data base. (one for "workspace data", one for "workspace cache" and one for "user storage") How are they related?

What's a "realm" ? These 3 storages seem to have "realms". What's it?

ignoring the synchronization of a file also means affecting the synchronization of its parent directory, as adding a new file also updates the parent manifest. This seems like a very tricky problem to me, considering the current implementation.

Yes, I get the issue. Maybe a better idea is to filter the file (ignoring some files) just at this update manifest change stage ? As a theoretical approach is to make parsec blind for these "temporary" files the earlier we can. To dig in this direction : when a user is creating (or the Office software in our case) a file in the mounted directory (winfsp). What is the pipeline inside parsec which are involved (processes path)? What is the first part of the program code which is "seeing" or detecting the new file? What is the process of procedures called when a new file is detected on local fs?
I guess this is something like new file detected -> manifest update -> file encryption (chunk?) -> uploaded to server (by the sync_monitor). As parsec is heavily async, this may not be so linear or simple to describe.

With these details, we can make a more rational decisions to solve this current issue. Also it can prevent unexpected side effect, because we can think of how one thing affects the others. Ultimately, this better view on parsec would benefit our commitment to participate on the development of parsec, including others parts. Also we can even help to write some docs or schematics.

I also think of a hybrid approach might be a good way :

  • filtering local new files at the manifest update stage (so ignored by parsec)
  • filtering remote new files (on case of outdated other clients) at the server (chunk?) to be downloaded. Can you confirm the server can't see any filenames (in clear)? Can the client download encrypted file name before downloading file content ?
  • filtering workspace display (hidding these temp files)

The main point of this current issue is to prevent the uploading to an object storage of the filtered "temp" files.

For the potential issue with many devices, we can let this aside at the beginning. At first stage, all filtering rules would be the same (assuming all users runs the same client version).

This kind of change can lead to unexpected behaviors, I hope a detail answer to our questions can dramatically lower this possibility, still we'll tackle things one thing at a time.

@vxgmichel
Copy link
Contributor

@bitlogik Thanks a lot for your investigation on this issue!

I'll be working on this for the next few days, so I'll get back to you with a more detailed answer by the end of the week.

@vxgmichel
Copy link
Contributor

Hi @bitlogik, sorry for the late reply !

After digging in the code, I identified a good way of filtering the sync : storage objects have a get_need_sync_entries() methods, called by SynContext ._load_changes()

Actually this method is only used to determine the entries in the storage that needs to be synchronized when connecting (or re-connecting) to the backend. So we need an entry point that is more general than that.

Still, this is not clear to me all this storage filesystems correspond to, and their relations, what's the file processing pipeline?

Most of the file system logic happens in the transaction files:

  • parsec/core/fs/workspacefs/file_transactions.py for file operations
  • parsec/core/fs/workspacefs/entry_transactions.py for folder operations
  • parsec/core/fs/workspacefs/sync_transactions.py for sync operations

In our case, we want our filtered files to exist just like any other file in the system except they should never be synchronized with the remote. That means most the work has to be done in the sync_transactions.py file.

I guess workspace if for the GUI, so we don't care for now.

Actually, everything in the workspacefs directory is related to the file system and it's used by both the GUI and mountpoints. However, the workspacefs.py module is a higher-level interface on top of the transactions I described above, and is indeed mostly used by the GUI. So you're right that the filtering logic is not going to go into this particular file.

What is manifest?

A manifest is an atomic and immutable object storing the information for a particular entry of the system (file, directory, workspace, user, etc.)

user storage?
Also I read somewhere in the code, there's at least one sqlite data base. (one for "workspace data", one for "workspace cache" and one for "user storage") How are they related?

I don't want to go to much into the details but those are local storage objects (local sqlite databases) used to store all the local manifests in persistent manner. Those storage objects can either be "cache" (storing data that is safely stored remotely) or "data" (storing data that only exists locally). Also those storage objects can either store user information or workspace information.

What's a "realm" ? These 3 storages seem to have "realms". What's it?

Realm is generic word for describing either a user or a workspace (or more precisely to identify a container for user/workspace information).

Maybe a better idea is to filter the file (ignoring some files) just at this update manifest change stage ? As a theoretical approach is to make parsec blind for these "temporary" files the earlier we can. To dig in this direction : when a user is creating (or the Office software in our case) a file in the mounted directory (winfsp). What is the pipeline inside parsec which are involved (processes path)? What is the first part of the program code which is "seeing" or detecting the new file? What is the process of procedures called when a new file is detected on local fs?
I guess this is something like new file detected -> manifest update -> file encryption (chunk?) -> uploaded to server (by the sync_monitor). As parsec is heavily async, this may not be so linear or simple to describe.

I think I can rephrase your questions as "At which level of the call chain do we want to implement the filtering?". And indeed there are many levels, from the mountpoint (fuse/winfsp) operations, to the remote loader which uploads the manifests and blocks of data to the cloud. We had some internal (and quite technical discussions) about it and we think that it has to go in those transactions I was mentioning earlier.

More precisely, I think we might be able to implement this filtering by simply tweaking the sync_transactions module. The idea would be as follow:

  • A temporary file is created
  • A new local manifest for the file is created, tagged as need_sync
  • The local folder manifest is updated (since it now contains a new file), and tagged as need_sync
  • Signals are sent to signal the sync monitor that new entries need to be synced
  • The sync monitor trigger the sync transaction for both the temporary file and its parent
  • [Up to this point, nothing has changed: the system is not affected by the filtering whatsoever]
  • Sync transaction of the file:
    • The file is detected as temporary
    • No remote manifest is produced (nothing to upload)
    • The need_sync flag is reset
  • Sync transaction of the parent:
    • The parent is detected as containing a temporary file
    • A remote manifest ignoring the temporary files is produced
    • If the remote manifest contains new information it is uploaded
    • Then the local manifest is updated accordingly with its need_sync flag reset

One extra thought: I'm thinking of applying an same kind of filtering for the changes downloaded from remote ("downsync"). This way, we make sure that local temporary files will never conflict with other temporary files uploaded by error by other users (for instance, a user with a different configuration). This would also work well with existing files.

Filtering remote new files (on case of outdated other clients) at the server (chunk?) to be downloaded.

I think we should aim at a solution where each client can have their own filtering scheme.

Can you confirm the server can't see any filenames (in clear)?

Yes, both the data and metadata is encrypted before being sent to the remote so it doesn't have access to this information.

Can the client download encrypted file name before downloading file content ?

Exactly: by filtering the temporary files out of the remote folder manifest, the client won't even download the corresponding file manifests.

filtering workspace display (hiding these temp files)

Right, we could even add option to show/hide those (local) temporary files.

For the potential issue with many devices, we can let this aside at the beginning. At first stage, all filtering rules would be the same (assuming all users runs the same client version).

I understand your point, but the solution I detailed earlier should allow us to implement both local and remote filtering at once. Hopefully it shouldn't be too painful.

Ultimately, this better view on parsec would benefit our commitment to participate on the development of parsec, including others parts. Also we can even help to write some docs or schematics.

That's very nice to hear! Hopefully this reply should help clarify a bit how the project is structured. I should be able to implement and push a proof of concept by the end of the week, I can tag you as reviewer if you're interested :)

@bitlogik
Copy link
Contributor

Thanks for the details. Based on the details we understand, this way seems to be a good one, regarding the foreseen possible problems. This should enhance the UX in "disabling synchronization on certain files based on their name pattern", as expected.
We'll review and test your implementation next week, when ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improve functionality, potentially changing behavior
Projects
No open projects
Dev Board
To review
5 participants