-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up lost files search for large number of ignore paths #74
Conversation
Thank you for another well thought out and thorough pull request.
This is fantastic, and I'm excited about this optimization. However, I think a switch is not the best way to activate it. Pedantically:
Practically:
What do you think? |
Oops, I forgot about escaping regular expression special characters. I think we can test correctness with some brute-force fuzzing to catch remaining bugs. |
Pretty much. I haven't looked at how it's implemented in GNU find, but judging from the behaviour, it looks like the way the filters are applied is just checking them in succession, with no optimization at all. While for regex they use a library which I guess does optimize common prefixes and so. So my first guess is that improving GNU find to get this case working faster would be a pretty huge task involving implementing some kind of optimizer for non-regex queries.
If the idea of actually supporting all patterns doesn't pan out, this definitely sounds like a better way to go about this, rather than a new switch (not sure why it hasn't even crossed my mind).
I had in mind that shell patterns were pretty complex (if you look at But with some luck the rules find supports (which at the end of the day are the ones we support nowadays) will be simple enough. I will take a further look to see what the actual rules are, and whether something is missing in your implementation. And as you said, fuzzing or at least testing a lot of cases is definitely the way to go to verify this. (And thanks for putting the time for the bash implementation, that will definitely be useful as a base to work on!)
GNU find is in findutils which is a dependency of base so I think we can safely assume it's there. I think we are already using non-standard extensions like
I don't think this is really something someone would typically hit (and as you said it more likely to be hit due to misuse) but I'll keep this in mind. |
UPDATE: It looks like GNU find just translates |
I took a look at it and found a good test suite in glibc-2.31. However, this seems to reveal that there's some more complexity that's not handled by the bash implementation, like:
Here's the test script: https://gist.github.com/joanbm/a9277f30a2a1c8a50454ede43065544c . I don't discard some problems are due to testing with My general thought is that since getting the implementation right is tricky and even if doable the code will be probably be rather ugly, I'd rather go with the simpler "whitelist" approach where we check if ignore expressions are 'simple' (non-special characters, asterisk wildcards, maybe also question mark wildcards) and transform them to regex if so. An important thing to note here is that we can implement it so that it's not an "all-or-nothing" thing, so that if the user has 200 simple ignore patterns and 10 crazy ignore patterns we can 'upgrade' the 200 simple ones to What do you think? |
Good finds, I missed these.
I think this is not an issue for our use case. I don't think Arch Linux can be made to work with another libc.
This is probably the only one that's not solvable by hammering at the bash implementation for a bit.
Sounds great! |
I've updated the PR with an implementation of the idea (of yours) of automatically detecting and upgrading simple patterns to regex, so the switch is no longer necessary. I'm quite satisfied of the final result, which, while still conceptually kind of a hack due to the abuse of GNU find's implementation behaviour, is self-contained and the code is relatively short and understandable. Some notes:
|
46ea603
to
359a5de
Compare
BTW, not really directly related to this PR, but I've been thinking if we could avoid all the "path rewriting" mocks on the unit tests by using something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff!
- So splitting the regex by
MAX_ARG_STRLEN
is useless unless this is resolved.
Argh, yes, of course.
aconfmgr
does support running as the root user, which is useful for setting up a new system, but as it's unlikely that execution mode is going to be the only one under which a configuration is expected to work, it's not worth breaking a sweat over. There is also a sudoserver
branch which may lift limitations imposed by sudo
, but I don't know if I'm going to finish / merge it.
src/common.bash
Outdated
# However, this is painfully slow if the number of ignore patterns is large | ||
# Instead, this function (ab)uses the fact that if the number of patterns | ||
# given to GNU find is big, then as an implementation detail, using regular | ||
# expressions (-regex option and similar) scale smuch better than using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo:
# expressions (-regex option and similar) scale smuch better than using | |
# expressions (-regex option and similar) scales much better than using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update.
src/common.bash
Outdated
@@ -153,6 +153,51 @@ function AconfCompileOutput() { | |||
skip_inspection=n | |||
skip_checksums=n | |||
|
|||
# Given a list of paths to ignore (which can contain shell patterns), | |||
# creates the list of arguments to give to 'find' to ignore them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# creates the list of arguments to give to 'find' to ignore them | |
# creates the list of arguments to give to 'find' to ignore them. | |
# The result is stored in the global variable "ignore_args". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a way to "cleanly" return an array from a bash function, without creating global variables or using pipes:
#!/bin/bash
set -eEuo pipefail
function Callee() {
local varname=$1
local -n var=$varname
# shellcheck disable=SC2034
var=(hello world)
}
function Caller() {
local -a hello
Callee hello
printf '%s ' "${hello[@]}"
printf '\n'
}
Caller
echo "$hello" # Unbound, not a global!
I didn't know about it when I wrote aconfmgr, which is why some places are written using less clean approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I didn't know that (I not too in-depth in Bash). Will update.
src/common.bash
Outdated
local ignore_path | ||
for ignore_path in "${ignore_paths[@]}" | ||
do | ||
if [[ "$ignore_path" =~ [^a-zA-Z0-9_/\ \-.*] ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, the code style in aconfmgr is to use line breaks instead of semicolons (shell equivalent of Allman braces).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update.
src/common.bash
Outdated
# Converting the simple ignore paths to regular expressions just needs to | ||
# handle the asterisk wildcard, and escape a few characters | ||
local ignore_regexps | ||
mapfile -t ignore_regexps < \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipe into mapfile
instead of using process substitution. We have lastpipe
on, so that will work and save a subshell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update.
src/common.bash
Outdated
local ignore_regexps | ||
mapfile -t ignore_regexps < \ | ||
<( IFS=$'\n' ; echo -n "${simple_ignore_paths[*]}" | \ | ||
sed 's|[^*a-zA-Z0-9_/ ]|[&]|g; s|\*|.*|g' ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sed -z
+ printf '%s\0'
to allow (really) weird characters like newlines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter because simple_ignore_paths
doesn't contain any weird characters by definition (see the regex used to build it a few lines before).
test/t/lib-funcs-common.bash
Outdated
@@ -102,6 +102,9 @@ test_globals_whitelist=( | |||
system_dir | |||
tmp_dir | |||
|
|||
# Internal state (AconfCreateFindIgnoreArgs) | |||
ignore_args | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case local -n
doesn't work out, I think it would be better to simply unset it once it's no longer needed, as this variable has finite lifetime anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local -n
works so I will just remove this.
test/t/lib-mocks.bash
Outdated
local regex_escaped_dir=$(echo -n "$test_data_dir"/files | \ | ||
sed 's/./[&]/g') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local regex_escaped_dir=$(echo -n "$test_data_dir"/files | \ | |
sed 's/./[&]/g') | |
local regex_escaped_dir=$(sed 's/./[&]/g' <<< "$test_data_dir"/files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update.
test/t/lib-mocks.bash
Outdated
|
||
local regex_escaped_dir=$(echo -n "$test_data_dir"/files | \ | ||
sed 's/./[&]/g') | ||
args+=("$regex_escaped_dir($arg)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args+=("$regex_escaped_dir($arg)") | |
args+=("$regex_escaped_dir($arg)") # Group |-delimited patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update.
Yep, a few issues here:
|
PR updated with your suggestions. Also, good to know about why unit test mocks work the way they do, this is exactly the kind of feedback I wanted to get! |
Thank you! The non-integration tests are failing on Travis. For some reason, it's not creating a status on the commit or PR. Also, I just noticed that the test file is not named correctly. |
When updating the system configuration with aconfmgr save, if a large number of ignored path patterns is used in the configuration (through the IgnorePath helper function), the "Searching for lost files..." phase can get considerably slow, even taking up most of the execution time if I/O is fast (such as when using a fast SSD disk). This happens because ignored path patterns are finally given as -wholename options to GNU find (the find implementation used in Arch), and as an implementation detail of this tool, the search time seems to grow linearly with the number of search patterns. However, also as an implementation detail of this tool, the -regex option scales much better. Therefore, passing the patterns using this option is substantially faster. However, of course, since IgnorePath accepts shell patterns, we need to find a way to convert those to regular expressions. A general shell pattern to regular expression conversion approach, while technically possible, is very complex, since there are many small details and edge cases where shell patterns and regular expressions differ. However, in practice, the majority of IgnorePath shell patterns we expect to find are 'simple' ones, made of simple, literal ASCII characters (e.g. alphanumeric, underscore, hyphen, slash) plus the asterisk wildcard. This commit adds some logic to detect which IgnorePath shell patterns are 'simple' (as explained previously), and converts them to an equivalent regular expression which is then provided to GNU find. This notably speeds up the execution due to the workings of GNU find. The non-'simple' patterns are still passed as shell patterns to find like they were before. The net result is that the optimization is applied transparently and should have no visible effect on the results.
This commit adds new unit tests that check the (already existing) shell pattern behavior of IgnorePath. Those unit tests also hit the new code paths introduced in the "Speed up lost files search for large number of ignore paths" optimization, so it serves as an additional validation for that code change.
Good catch, I've updated this for the PR.
Well... that's very awkward, but it just helped me found a huge problem with the PR. In this line in the previous PR:
It turns out that even with a backslash, a hyphen that's not at the start or end always behaves as a range ( https://stackoverflow.com/a/55377888 ) so this regex was including more characters than expected. Due to the configuration of the Travis container this also included '?' so that's why the test failed. Furthermore, after this I was a bit sceptical so I did some fuzz testing and generated strings with this script and found a further problem which is that After I fixed the regex by enumerating each ASCII character instead of using a range, I could find no more problems by fuzzing and all tests now pass on my Travis container, so I've updated the PR with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much again for this improvement and the thoroughness of your work.
I unauthorized the Travis app from my GitHub account and logged in on its website, and it seems to be creating commit statuses again. |
Thank you so much as well! This took more time than I expected to get right, so I appreciate all the time you've also spent to make this happen. |
This PR implements a new option that can make the "Searching for lost files..." phase during
aconfmgr save
much faster, if a large number of ignored path patterns (IgnorePath
) is used.The problem with using many ignored path patterns is that those end up being passed as a large list of
-wholename
conditions to find (which in Arch Linux is GNU find), which causes the execution time of find (and hence of the entire phase) to scale with the number of ignored path patterns.This can end up supposing the majority of the execution time. For example, I currently have 270
IgnorePath
patterns on my configuration, and while a typical run with--skip-checksums
but without this new option in this PR takes 35 seconds, the same run with the new option in this PR takes 15 seconds.The way this speedup is implemented basically abuses implementation details of GNU find and has some important limitations, so honestly feel free to reject your PR if you don't want to have this junk in aconfmgr. Basically, the
-regex
option of GNU find scales much better with an increasing number of patterns (since it uses a dedicated regex library), so that's what this PR uses. The problem of course is that-wholename
takes a shell pattern while-regex
takes a regular expression, so we need to convert one in the other. This is pretty complex and not really a job for aconfmgr, so I've just implemented asterisk wildcards, which is sufficient for all my use cases.The new option is called
--quick-lostfiles
and changes the lost file search behaviour inaconfmgr save
to use the regex trick I mentioned earlier. Since it has the limitation I mentioned about only accepting asterisk wildcards, it's isn't (can't be) enabled by default.As an additional reference I also sent a similar PR to lostfiles ( graysky2/lostfiles#42 ) but it got stuck in a limbo.