Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend indexer rules with .gitignore when available #2459

Merged

Conversation

matheus-consoli
Copy link
Contributor

Extend the file indexer to account for ignored files within a git repository.


Example

Using this .gitignore:

partial/*.ignoreme
secrets

and this file structure:

myfiles
├── file
├── partial
│  ├── file.ignoreme
│  └── readme.md
└── secrets

the files listed in the .gitignore are skipped by the indexer and does not appear in the frontend:

image
image

Closes #(issue)

Copy link

vercel bot commented May 7, 2024

@matheus-consoli is attempting to deploy a commit to the Spacedrive Team on Vercel.

A member of the Team first needs to authorize it.

@matheus-consoli matheus-consoli force-pushed the eng-1357-enhance-indexer-rules branch from 3827c7d to 9727e5f Compare May 8, 2024 01:20
@0xBA5E64
Copy link
Contributor

0xBA5E64 commented May 8, 2024

I'm personally not entirely sure how I feel about this; Having the indexer skip over certain files feels like it defeats any chance of Spacedrive being used as a true file-explorer replacement it's supposedly aims to be. A developer building an application being unable to even see their dist/ folder or local dotenv files due to it being part of the project's .gitignore for instance seems frustrating at best.

.gitignore isn't necessarily used to hide files, after all but rather just to keep them out of git's versioning, which may be done for a variety of reasons, including specifically for local-only files like dotenv's which very much might want to be accessed.

I could understand perhaps disabling higher-level metadata extraction or something similar for files falling under .gitignore, but fully excluding them from the index should probably be, at the very least, an option the user should have to set themselves upon creating the Location, as opposed to the default.

@matheus-consoli
Copy link
Contributor Author

I agree with you, @0xBA5E64 - .gitignore (and other possible sources of rules) should be an opt-out option for users.

I'm still quite new to this codebase, but I think the current indexer cannot express this configuration for now.

I would love to be pointed wrong though, and it's probably something that should be added to the roadmap

@matheus-consoli matheus-consoli force-pushed the eng-1357-enhance-indexer-rules branch from 130a9ec to 29943f5 Compare May 8, 2024 21:29
@matheus-consoli matheus-consoli marked this pull request as ready for review May 9, 2024 18:31
@matheus-consoli matheus-consoli requested a review from a team as a code owner May 9, 2024 18:31
@fogodev
Copy link
Contributor

fogodev commented May 9, 2024

I'm personally not entirely sure how I feel about this; Having the indexer skip over certain files feels like it defeats any chance of Spacedrive being used as a true file-explorer replacement it's supposedly aims to be. A developer building an application being unable to even see their dist/ folder or local dotenv files due to it being part of the project's .gitignore for instance seems frustrating at best.

.gitignore isn't necessarily used to hide files, after all but rather just to keep them out of git's versioning, which may be done for a variety of reasons, including specifically for local-only files like dotenv's which very much might want to be accessed.

I could understand perhaps disabling higher-level metadata extraction or something similar for files falling under .gitignore, but fully excluding them from the index should probably be, at the very least, an option the user should have to set themselves upon creating the Location, as opposed to the default.

The indexer rules aren't only for showing files in the explorer, for indexed locations they define what should or shouldn't be stored in db and syncronized between multiple devices. I agree that maybe excluding these files from the explorer may not be great, but usually files ignored by .gitignore, really must not be stored in db synchronized between multiple devices. Imagine having to store and sync a node_modules directory for example, LMAO.

@fogodev
Copy link
Contributor

fogodev commented May 9, 2024

A possible option, I've discussed internally this week, is about merging results from database and from file system when showing indexed locations in the explorer. This way we will index everything for storing and sync purposes according to indexer rules, but show everything contained in the file system, maybe with a small icon in files not in db for example.

@HeavenVolkoff
Copy link
Member

merging results from database and from file system when showing indexed locations in the explorer. This way we will index everything for storing and sync purposes according to indexer rules, but show everything contained in the file system, maybe with a small icon in files not in db for example.

IMHO I think this is the best approach. The only question is if we handle this in the frontend or the backend. Technically we already have all the api call necessary for handling this entirely in the front, just need to call search.ephemeralPaths for the current path in the indexed location, and merge the results with those from the indexer. However I wonder if handling this in the backend wouldn't be better. It could check when the application is requesting a local indexed location and internally use the logic for the ephemeral location and merge the results and just send then to the front, considering we already handle Path and NonIndexedPath in a generic way it should work ootb.

core/crates/indexer-rules/src/seed.rs Outdated Show resolved Hide resolved
Copy link

vercel bot commented May 9, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
spacedrive-landing ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 17, 2024 6:18am
spacedrive-web ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 17, 2024 6:18am

@0xBA5E64
Copy link
Contributor

A file being listed in .gitignore only means for certain that the user doesn't want it as part of their git repo, it doesn't necessarily mean it's disposable, let alone that a user doesn't want it synced across systems.

The indexer rules aren't only for showing files in the explorer, for indexed locations they define what should or shouldn't be stored in db and syncronized between multiple devices.

Indeed, but I urge you to look at this in the reverse as well: The indexer rules don't just define what's to be synchronized between multiple devices, but also, what's being indexed and showed in the explorer.

Imagine having to store and sync a node_modules directory for example

For things like web-development, node_modules is indeed a great example of a folder you'd likely not want to sync in-between machines, but that might not be the case for all other files included in a .gitignore file, like logs, *.tgz builds, dotenv files, local sqlite databases etc.

What about, say, game-devs, that occasionally leave out all their game assets out of version management via .gitignore due to size constraints or simple lack of understanding of Git LFS? Or, sensitive info like dotenv-files users might not want synced to a public repo, but users might still want synced across your private devices, or at the very least, expect will be syncing when they are syncing entire locations through a tool like Spacedrive as opposed to simply git cloneing them on all their machines?

Excluding files from the indexer based on .gitignore might seem intuitive, but I can imagine a lot of people, including myself, find it quite off-putting if "syncing" a directory through Spacedrive didn't actually sync the folder, as-is, but instead made opinionated decisions behind-the-scenes without my explicit knowing or intervention this way.

To be clear I do certainly find the value in something like this, I just feel that something like avoiding node_modules from syncing should be part of the sync-logic, as opposed to the indexing logic. And either way, be an optional feature listed in either the Location settings;
Screenshot 2024-05-10 at 10 49 51
-or future "Sync settings".

@HeavenVolkoff
Copy link
Member

Indeed, but I urge you to look at this in the reverse as well: The indexer rules don't just define what's to be synchronized between multiple devices, but also, what's being indexed and showed in the explorer.

Yeah that is the intent, the sync system is being develop as an extension of the indexer. Under the hood the sync is mostly a way for sharing your indexed database between Spacedrive clients + the ability to request remote file data on-the-fly through a p2p connection. So, the indexer rules will influence what is being synced. About what is being displayed on the explorer we are discussing how to improve this, the idea is to integrate our indexed locations with the logic for the ephemeral file explorer, so that a local machine can seamlessly display all files for a location independently if they are actually being indexed or not. That way the indexed locations would work more like a normal file explorer with the added benefit of almost instantaneous search and sync for the indexed files. We would also like to implement something similar for the sync system, such as a client could remotely request an ephemeral view of a synced location to allow the user to see all the files on the remote Spacedrive, as long as the source is online.

To be clear I do certainly find the value in something like this, I just feel that something like avoiding node_modules from syncing should be part of the sync-logic, as opposed to the indexing logic. And either way, be an optional feature listed in either the Location settings or future "Sync settings".

We need to find a balance so that the indexer, by default, can be performant enough while retaining most of its usability. Having good defaults for files being ignored at first is an important part of this, and ignoring files listed in .gitignore seems like a good default so far. However we are very much in favor of every bit of the indexer settings to be configurable by the user, so this rule (to ignore files in the .gitignore ) will be something toggleable by the user in the near future as well as any other indexer rule we add. About allowing specific files to be synced even when they are being ignored by some general indexer rule, we will need to implement a prioritization system to fully support the .gitignore syntax, with that in place it would be possible for the user to add a rule to force adding files that are being ignored by a more general rule, we could then expand the ContextMenu in the Explorer to allow the user to select non-indexed items and enable indexing, and consequently sync, for them by transparently creating such index rule. That way if a rule, such as .gitignore, removes some files that the user needs to be indexed/synced he could easily add them back.

Copy link
Member

@HeavenVolkoff HeavenVolkoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. However some files ignored by the root .gitignore which are present in sub-directories are still being indexed. I think this can be fixed if we adapt the implementation of the following functions from gix-ignore:
gix_ignore::search::Search::pattern_matching_relative_path
gix_ignore::search::pattern_matching_relative_path

However I would be fine merging this as is and leaving the changes recommended above as an improvement for later if they are not something straight-forward to implement.

@matheus-consoli matheus-consoli requested a review from a team as a code owner May 16, 2024 04:48
Copy link
Member

@HeavenVolkoff HeavenVolkoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Great work @matheus-consoli

@HeavenVolkoff HeavenVolkoff added this pull request to the merge queue May 17, 2024
Merged via the queue into spacedriveapp:main with commit 26b6baf May 17, 2024
12 checks passed
@matheus-consoli matheus-consoli deleted the eng-1357-enhance-indexer-rules branch May 17, 2024 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants