Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up lost files search for large number of ignore paths #74

Merged
merged 2 commits into from
Apr 6, 2020

Conversation

joanbm
Copy link
Contributor

@joanbm joanbm commented Apr 3, 2020

This PR implements a new option that can make the "Searching for lost files..." phase during aconfmgr save much faster, if a large number of ignored path patterns (IgnorePath) is used.

The problem with using many ignored path patterns is that those end up being passed as a large list of -wholename conditions to find (which in Arch Linux is GNU find), which causes the execution time of find (and hence of the entire phase) to scale with the number of ignored path patterns.

This can end up supposing the majority of the execution time. For example, I currently have 270 IgnorePath patterns on my configuration, and while a typical run with --skip-checksums but without this new option in this PR takes 35 seconds, the same run with the new option in this PR takes 15 seconds.

The way this speedup is implemented basically abuses implementation details of GNU find and has some important limitations, so honestly feel free to reject your PR if you don't want to have this junk in aconfmgr. Basically, the -regex option of GNU find scales much better with an increasing number of patterns (since it uses a dedicated regex library), so that's what this PR uses. The problem of course is that -wholename takes a shell pattern while -regex takes a regular expression, so we need to convert one in the other. This is pretty complex and not really a job for aconfmgr, so I've just implemented asterisk wildcards, which is sufficient for all my use cases.

The new option is called --quick-lostfiles and changes the lost file search behaviour in aconfmgr save to use the regex trick I mentioned earlier. Since it has the limitation I mentioned about only accepting asterisk wildcards, it's isn't (can't be) enabled by default.

As an additional reference I also sent a similar PR to lostfiles ( graysky2/lostfiles#42 ) but it got stuck in a limbo.

@CyberShadow
Copy link
Owner

Thank you for another well thought out and thorough pull request.

and while a typical run with --skip-checksums but without this new option in this PR takes 35 seconds, the same run with the new option in this PR takes 15 seconds.

This is fantastic, and I'm excited about this optimization. However, I think a switch is not the best way to activate it.

Pedantically:

  • Obviously, the best place to fix it would be to improve GNU find to perform an optimization such as this implicitly. We're assuming this is not an achievable goal within the scope that we are limiting ourselves to.

  • I think the next best approach would be to detect when this optimization can be safely enabled, and then always enable it if so. If done right, no switch or other mode of configuration would be necessary. If detecting the version of find is necessary, it would be preferable to test its behavior instead of parsing any version strings.

  • If the above cannot be done for some reason, then I think the most way to enable this optimization would be the user configuration, because it affects the behavior of IgnorePath lines, which is where they are located.

Practically:

  • I think fully translating globs (wildcard patterns) to regular expressions is an entirely achievable task. I've put together a pure bash implementation here which I believe to be correct.

  • According to pacman -F /usr/bin/find, GNU find is the only version of find which could be found on an Arch Linux system (and therefore a system that aconfmgr is running on). As such, we can assume that it will support the -regex switch, and that the performance is most likely going to be along the lines of what you have found.

  • As far as I can see, the only new limitation that we need to be aware of with this approach is hitting the maximum argument length, which is apparently hard-coded to 128kB on Linux. Though it's an unlikely limit to hit, it may occur when ignore rules are generated automatically (usually due to the user forgetting to escape wildcards in their IgnorePath lines, causing them to be expanded during configuration parsing). I think it would be easy to overcome by making sure that the created regular expression is split up in arguments no longer than 131072 characters each, which could then be chained using -o switches like how -wholename rules are chained now.

What do you think?

@CyberShadow
Copy link
Owner

CyberShadow commented Apr 3, 2020

  • I've put together a pure bash implementation here which I believe to be correct.

Oops, I forgot about escaping regular expression special characters. I think we can test correctness with some brute-force fuzzing to catch remaining bugs.

@joanbm
Copy link
Contributor Author

joanbm commented Apr 4, 2020

  • Obviously, the best place to fix it would be to improve GNU find to perform an optimization such as this implicitly. We're assuming this is not an achievable goal within the scope that we are limiting ourselves to.

Pretty much. I haven't looked at how it's implemented in GNU find, but judging from the behaviour, it looks like the way the filters are applied is just checking them in succession, with no optimization at all. While for regex they use a library which I guess does optimize common prefixes and so. So my first guess is that improving GNU find to get this case working faster would be a pretty huge task involving implementing some kind of optimizer for non-regex queries.

  • I think the next best approach would be to detect when this optimization can be safely enabled, and then always enable it if so. If done right, no switch or other mode of configuration would be necessary.

If the idea of actually supporting all patterns doesn't pan out, this definitely sounds like a better way to go about this, rather than a new switch (not sure why it hasn't even crossed my mind).

  • I think fully translating globs (wildcard patterns) to regular expressions is an entirely achievable task. I've put together a pure bash implementation here which I believe to be correct.

I had in mind that shell patterns were pretty complex (if you look at man glob's "GLOBBING PATTERNS" there's already a few special rules). Additionally I'm not sure what 'variant' GNU find implements since for example running find . -wholename '*' matches dotfiles (hidden files) while echo * does not.

But with some luck the rules find supports (which at the end of the day are the ones we support nowadays) will be simple enough. I will take a further look to see what the actual rules are, and whether something is missing in your implementation. And as you said, fuzzing or at least testing a lot of cases is definitely the way to go to verify this.

(And thanks for putting the time for the bash implementation, that will definitely be useful as a base to work on!)

  • According to pacman -F /usr/bin/find, GNU find is the only version of find which could be found on an Arch Linux system (and therefore a system that aconfmgr is running on). As such, we can assume that it will support the -regex switch, and that the performance is most likely going to be along the lines of what you have found.

GNU find is in findutils which is a dependency of base so I think we can safely assume it's there. I think we are already using non-standard extensions like -wholename so it isn't even a new problem.

  • As far as I can see, the only new limitation that we need to be aware of with this approach is hitting the maximum argument length, which is apparently hard-coded to 128kB on Linux. Though it's an unlikely limit to hit, it may occur when ignore rules are generated automatically (usually due to the user forgetting to escape wildcards in their IgnorePath lines, causing them to be expanded during configuration parsing). I think it would be easy to overcome by making sure that the created regular expression is split up in arguments no longer than 131072 characters each, which could then be chained using -o switches like how -wholename rules are chained now.

I don't think this is really something someone would typically hit (and as you said it more likely to be hit due to misuse) but I'll keep this in mind.

@joanbm
Copy link
Contributor Author

joanbm commented Apr 4, 2020

UPDATE: It looks like GNU find just translates -wholename pattern to fnmatch(pattern, path, 0).

@joanbm
Copy link
Contributor Author

joanbm commented Apr 4, 2020

I took a look at it and found a good test suite in glibc-2.31. However, this seems to reveal that there's some more complexity that's not handled by the bash implementation, like:

  • Escapes with backslash like the pattern \<\> should match <>.
  • "Double brackets" like [[:alnum:]-] should match -.
  • Range edge cases like [a-c-0-9] are valid in glob but not regex.
  • i18n issues like [a-z] should match ä in de_DE.ISO-8859-1 locale.
  • Library differences like [[.-.]] matches - in glibc but not musl.

Here's the test script: https://gist.github.com/joanbm/a9277f30a2a1c8a50454ede43065544c . I don't discard some problems are due to testing with grep -E instead of find or other issues.

My general thought is that since getting the implementation right is tricky and even if doable the code will be probably be rather ugly, I'd rather go with the simpler "whitelist" approach where we check if ignore expressions are 'simple' (non-special characters, asterisk wildcards, maybe also question mark wildcards) and transform them to regex if so. An important thing to note here is that we can implement it so that it's not an "all-or-nothing" thing, so that if the user has 200 simple ignore patterns and 10 crazy ignore patterns we can 'upgrade' the 200 simple ones to -regex but keep the 10 crazy ones as -wholename.

What do you think?

@CyberShadow
Copy link
Owner

  • Escapes with backslash like the pattern \<\> should match <>.

  • "Double brackets" like [[:alnum:]-] should match -.

  • Range edge cases like [a-c-0-9] are valid in glob but not regex.

Good finds, I missed these.

Library differences like [[.-.]] matches - in glibc but not musl.

I think this is not an issue for our use case. I don't think Arch Linux can be made to work with another libc.

i18n issues like [a-z] should match ä in de_DE.ISO-8859-1 locale.

This is probably the only one that's not solvable by hammering at the bash implementation for a bit.

An important thing to note here is that we can implement it so that it's not an "all-of-nothing" thing, so that if the user has 200 simple ignore patterns and 10 crazy ignore patterns we can 'upgrade' the 200 simple ones to -regex but keep the 10 crazy ones as -wholename.

Sounds great!

@joanbm
Copy link
Contributor Author

joanbm commented Apr 5, 2020

I've updated the PR with an implementation of the idea (of yours) of automatically detecting and upgrading simple patterns to regex, so the switch is no longer necessary. I'm quite satisfied of the final result, which, while still conceptually kind of a hack due to the abuse of GNU find's implementation behaviour, is self-contained and the code is relatively short and understandable.

Some notes:

  • I've chosen to only whitelist a few obviously literal characters (alphanumeric, underscore, hyphen, slash, dot, space, hyphen, and of course the asterisk wildcards) which I expect to be the typical ones that are used in IgnorePath, and they suffice for all my patterns. I could have probably added more like e.g. ;=#$%& but the benefit is minimal (they just get dropped to the 'slow' path) and I don't want to play with fire and end up with a regression.

  • I played around with what you said about splitting the regular expression in chunks of size MAX_ARG_STRLEN (the maximum length of a single command line argument, 128K on Linux), to prevent hitting this limit prematurely if many patterns are used (and so allowing using up to ARG_MAX, typically 2MB on Linux). However, this won't work due to another restriction: Since we invoke find as root with sudo, and sudo sets the SUDO_COMMAND environment variable to the full command line, and enviornment variables are also limited to MAX_ARG_STRLEN length, currently our effective total command line length limit for find is (and was already) MAX_ARG_STRLEN, not ARG_MAX. So chunking the regex by MAX_ARG_STRLEN is useless unless this is resolved (which I found no obvious solution for).

@joanbm joanbm changed the title Add option to speed up lost files search (with limitations) Speed up lost files search for large number of ignore paths Apr 5, 2020
@joanbm joanbm force-pushed the master branch 2 times, most recently from 46ea603 to 359a5de Compare April 5, 2020 12:27
@joanbm
Copy link
Contributor Author

joanbm commented Apr 5, 2020

BTW, not really directly related to this PR, but I've been thinking if we could avoid all the "path rewriting" mocks on the unit tests by using something like fakeroot or bubblewrap to make the tests believe they can write to the root file system (while still running as a normal user and using the binaries of the host instead of an isolated container). I'm not sure if that's really viable or if you could say it makes the unit tests more like integration tests though.

Copy link
Owner

@CyberShadow CyberShadow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!

  • So splitting the regex by MAX_ARG_STRLEN is useless unless this is resolved.

Argh, yes, of course.

aconfmgr does support running as the root user, which is useful for setting up a new system, but as it's unlikely that execution mode is going to be the only one under which a configuration is expected to work, it's not worth breaking a sweat over. There is also a sudoserver branch which may lift limitations imposed by sudo, but I don't know if I'm going to finish / merge it.

src/common.bash Outdated
# However, this is painfully slow if the number of ignore patterns is large
# Instead, this function (ab)uses the fact that if the number of patterns
# given to GNU find is big, then as an implementation detail, using regular
# expressions (-regex option and similar) scale smuch better than using
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

Suggested change
# expressions (-regex option and similar) scale smuch better than using
# expressions (-regex option and similar) scales much better than using

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

src/common.bash Outdated
@@ -153,6 +153,51 @@ function AconfCompileOutput() {
skip_inspection=n
skip_checksums=n

# Given a list of paths to ignore (which can contain shell patterns),
# creates the list of arguments to give to 'find' to ignore them
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# creates the list of arguments to give to 'find' to ignore them
# creates the list of arguments to give to 'find' to ignore them.
# The result is stored in the global variable "ignore_args".

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way to "cleanly" return an array from a bash function, without creating global variables or using pipes:

#!/bin/bash
set -eEuo pipefail

function Callee() {
	local varname=$1
	local -n var=$varname

	# shellcheck disable=SC2034
	var=(hello world)
}

function Caller() {
	local -a hello
	Callee hello
	printf '%s ' "${hello[@]}"
	printf '\n'
}

Caller

echo "$hello" # Unbound, not a global!

I didn't know about it when I wrote aconfmgr, which is why some places are written using less clean approaches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I didn't know that (I not too in-depth in Bash). Will update.

src/common.bash Outdated
local ignore_path
for ignore_path in "${ignore_paths[@]}"
do
if [[ "$ignore_path" =~ [^a-zA-Z0-9_/\ \-.*] ]]; then
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, the code style in aconfmgr is to use line breaks instead of semicolons (shell equivalent of Allman braces).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

src/common.bash Outdated
# Converting the simple ignore paths to regular expressions just needs to
# handle the asterisk wildcard, and escape a few characters
local ignore_regexps
mapfile -t ignore_regexps < \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipe into mapfile instead of using process substitution. We have lastpipe on, so that will work and save a subshell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

src/common.bash Outdated
local ignore_regexps
mapfile -t ignore_regexps < \
<( IFS=$'\n' ; echo -n "${simple_ignore_paths[*]}" | \
sed 's|[^*a-zA-Z0-9_/ ]|[&]|g; s|\*|.*|g' )
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sed -z + printf '%s\0' to allow (really) weird characters like newlines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter because simple_ignore_paths doesn't contain any weird characters by definition (see the regex used to build it a few lines before).

src/common.bash Show resolved Hide resolved
@@ -102,6 +102,9 @@ test_globals_whitelist=(
system_dir
tmp_dir

# Internal state (AconfCreateFindIgnoreArgs)
ignore_args

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case local -n doesn't work out, I think it would be better to simply unset it once it's no longer needed, as this variable has finite lifetime anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local -n works so I will just remove this.

Comment on lines 44 to 45
local regex_escaped_dir=$(echo -n "$test_data_dir"/files | \
sed 's/./[&]/g')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
local regex_escaped_dir=$(echo -n "$test_data_dir"/files | \
sed 's/./[&]/g')
local regex_escaped_dir=$(sed 's/./[&]/g' <<< "$test_data_dir"/files)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.


local regex_escaped_dir=$(echo -n "$test_data_dir"/files | \
sed 's/./[&]/g')
args+=("$regex_escaped_dir($arg)")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
args+=("$regex_escaped_dir($arg)")
args+=("$regex_escaped_dir($arg)") # Group |-delimited patterns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

@CyberShadow
Copy link
Owner

BTW, not really directly related to this PR, but I've been thinking if we could avoid all the "path rewriting" mocks on the unit tests by using something like fakeroot or bubblewrap to make the tests believe they can write to the root file system

Yep, a few issues here:

  • fakeroot doesn't present a new tree to child programs, it just makes it mutable.
  • bubblewrap can construct a new FS tree, however, the tree must contain all the userspace stuff needed for the program to run (shared libraries + loaders, CLI tools, /dev, ...).
  • There is no running away from mocking pacman and friends. Even if you could get pacman to run in the fake root, you wouldn't be able to run it on non-Arch distros (e.g. Travis CI).
  • Minimizing dependencies and maximizing execution speed are goals for the mocked test mode.
  • To some extent, mocking (and thus reimplementing) the coreutils stuff helps verify our assumptions about them. There were some cases where mismatches in behavior between integration and mocked modes revealed bugs in both the test suite and aconfmgr.

@joanbm
Copy link
Contributor Author

joanbm commented Apr 5, 2020

PR updated with your suggestions.

Also, good to know about why unit test mocks work the way they do, this is exactly the kind of feedback I wanted to get!

@CyberShadow
Copy link
Owner

Thank you!

The non-integration tests are failing on Travis. For some reason, it's not creating a status on the commit or PR.
https://travis-ci.org/github/CyberShadow/aconfmgr/jobs/671264120

Also, I just noticed that the test file is not named correctly. t-1_save-2_files-4_ignored_2_shellpatterns should be t-1_save-2_files-4_ignored-2_shellpatterns (dashes before numbers).

When updating the system configuration with aconfmgr save, if a large
number of ignored path patterns is used in the configuration (through
the IgnorePath helper function), the "Searching for lost files..."
phase can get considerably slow, even taking up most of the execution
time if I/O is fast (such as when using a fast SSD disk).

This happens because ignored path patterns are finally given as
-wholename options to GNU find (the find implementation used in Arch),
and as an implementation detail of this tool, the search time seems to
grow linearly with the number of search patterns.

However, also as an implementation detail of this tool, the -regex
option scales much better. Therefore, passing the patterns using this
option is substantially faster. However, of course, since IgnorePath
accepts shell patterns, we need to find a way to convert those to
regular expressions.

A general shell pattern to regular expression conversion approach,
while technically possible, is very complex, since there are many
small details and edge cases where shell patterns and regular
expressions differ. However, in practice, the majority of IgnorePath
shell patterns we expect to find are 'simple' ones, made of simple,
literal ASCII characters (e.g. alphanumeric, underscore, hyphen, slash)
plus the asterisk wildcard.

This commit adds some logic to detect which IgnorePath shell patterns
are 'simple' (as explained previously), and converts them to an
equivalent regular expression which is then provided to GNU find. This
notably speeds up the execution due to the workings of GNU find.
The non-'simple' patterns are still passed as shell patterns to find
like they were before. The net result is that the optimization is
applied transparently and should have no visible effect on the results.
This commit adds new unit tests that check the (already existing)
shell pattern behavior of IgnorePath. Those unit tests also hit the
new code paths introduced in the "Speed up lost files search for large
number of ignore paths" optimization, so it serves as an additional
validation for that code change.
@joanbm
Copy link
Contributor Author

joanbm commented Apr 5, 2020

Also, I just noticed that the test file is not named correctly.

Good catch, I've updated this for the PR.

The non-integration tests are failing on Travis.

Well... that's very awkward, but it just helped me found a huge problem with the PR. In this line in the previous PR:

if [[ "$ignore_path" =~ [^a-zA-Z0-9_/\ \-.*] ]]

It turns out that even with a backslash, a hyphen that's not at the start or end always behaves as a range ( https://stackoverflow.com/a/55377888 ) so this regex was including more characters than expected. Due to the configuration of the Travis container this also included '?' so that's why the test failed.

Furthermore, after this I was a bit sceptical so I did some fuzz testing and generated strings with this script and found a further problem which is that a-z included non-ASCII characters like ä on some locales (I guess that's obvious in retrospect). While this also worked on the regexes (at least the cases I found) it was definitely not the expected behaviour.

After I fixed the regex by enumerating each ASCII character instead of using a range, I could find no more problems by fuzzing and all tests now pass on my Travis container, so I've updated the PR with this.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 93.296% when pulling 4244334 on joanbm:master into 9deb121 on CyberShadow:master.

Copy link
Owner

@CyberShadow CyberShadow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much again for this improvement and the thoroughness of your work.

@CyberShadow CyberShadow merged commit f93a796 into CyberShadow:master Apr 6, 2020
@CyberShadow
Copy link
Owner

For some reason, it's not creating a status on the commit or PR.

I unauthorized the Travis app from my GitHub account and logged in on its website, and it seems to be creating commit statuses again.

@joanbm
Copy link
Contributor Author

joanbm commented Apr 6, 2020

Thank you very much again for this improvement and the thoroughness of your work.

Thank you so much as well! This took more time than I expected to get right, so I appreciate all the time you've also spent to make this happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants