Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

globwalk looks at all files even when it doesn't need to #29

Open
jyn514 opened this issue Jun 28, 2020 · 11 comments
Open

globwalk looks at all files even when it doesn't need to #29

jyn514 opened this issue Jun 28, 2020 · 11 comments

Comments

@jyn514
Copy link

jyn514 commented Jun 28, 2020

https://docs.rs/globwalk/0.8.0/src/globwalk/lib.rs.html#355

If the top-level directory is not a match, no subdirectories or files in the directory will be a match either. However globwalk will still look at them. If there are many files in the ignored directory, this can cause enormous slowdowns. In https://github.com/rust-lang/docs.rs/pull/861/files#diff-ee6431d852ce8913514eece9e3982d32R96-R99, we have several hundred thousand files in subdirectories, but only ~10 in the matched directory, so this causes a slowdown of several orders of magnitude.

@jyn514
Copy link
Author

jyn514 commented Jun 28, 2020

I'm willing to work on this.

@robinfriedli
Copy link

Same here. In my case I'm using Tera, which uses globwalk to handle globs and I noticed that my application is practically unusable when running in a Docker container where the working directory is the root directory of the container because globwalk spends hours walking directories like /sys, I've never even let it finish so who knows how much longer it would have spent scanning my entire file system when it really should just find two html files in src/resources/templates/*.html. I haven't noticed it when running it locally but even then it walks through the entire target/ directory.

Luckily, the fix appears to be rather simple. The iterator does not handle the case where ignore::overrides::Override::matched returns Match::None and only skips the directory if it returns Match::Ignore. So adding the following after line 410 should suffice:

Match::None if is_dir => {
    skip_dir = true;
    continue 'skipper;
}

Then globwalk seems to skip irrelevant directories as expected.

@Gilnaa
Copy link
Owner

Gilnaa commented May 11, 2021

@robinfriedli thank you for working on it!
I'll set a reminder to tix this in the upcoming days

@Gilnaa
Copy link
Owner

Gilnaa commented May 11, 2021

After a brief check, it sadly doesn't work exactly like that, and filters out correct files.
After applying the patch, the following tests fail after not finding any files:

failures:
    tests::test_blacklist
    tests::test_case_insensitive_matching
    tests::test_from_patterns

It looks like Match::None means that the path (the directory) did not match any of the paths, but it still can be a prefix of a correct match.

Can you supply an example where this worked for you?

@robinfriedli
Copy link

Yeah I jumped the gun a bit here. It is not quite as simple unfortunately as the directory being part of the pattern also results in a Match::None. Maybe it would be enough to check whether the directory is a prefix of the glob?

@Gilnaa
Copy link
Owner

Gilnaa commented May 11, 2021

We can construct a second glob for dirs that contains the prefixes of the original positive patterns.
Not sure this cover all cases; especially WRT to **, but it's a good start

@robinfriedli
Copy link

I thought of that but do we actually want to yield those directories? I guess we could keep those globs separate and only use them to decide on whether to skip. I'm not 100% sure either but creating a glob for each child path (using the path components) might work. For example, for the glob **/templates/*.html we would create the globs **/ and **/templates/, which would match any directory.

@robinfriedli
Copy link

That does seem to work to some extent, at least all tests pass (except for the readme test) and it does seem to skip irrelevant directories. I can open a PR if you like.

@Gilnaa
Copy link
Owner

Gilnaa commented May 12, 2021

Gladly. I'll try to attack it with tests over the weekend to see if we missed something.

Thanks again!

@robinfriedli
Copy link

No problem 😄Pull request opened

@JohnAZoidberg
Copy link

Same here. In my case I'm using Tera, which uses globwalk to handle globs and I noticed that my application is practically unusable when running in a Docker container where the working directory is the root directory of the container because globwalk spends hours walking directories like /sys

Wow, I just ran into this exact same issue. It probably would never finish due to recursive symlinks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants