New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libglobbing does not support escaped slashes #2297

Open
kodebach opened this Issue Nov 17, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@kodebach
Contributor

kodebach commented Nov 17, 2018

Problem

Currently the new libglobbing does not support escaped slashes. This is because the library relies on, fnmatch(3) with the flags FNM_NOESCAPE and FNM_PATHNAME enabled.

Possible Solution

Instead of fnmatch we could use regex.h. This would require translating the globbing expressions into regular expressions. Additionally we would need to escape any of the regex special characters.

The regular expression ([^\\/]|(\\\\)*\\/|\\\\)+ should match a single key name part.
I tested the regex with this script (an additional \ has to be added because of the way sed expressions work):

cat <<'EOF' |
test/
other test/
more/test/123
adv\/anced/t\\\/est
another\\\\/test
EOF
sed -E 's/([^\\/]|(\\\\)*\\\/|(\\\\))+/{{\0}}/g'

Which prints:

{{test}}/
{{other test}}/
{{more}}/{{test}}/{{123}}
{{adv\/anced}}/{{t\\\/est}}
{{another\\\\}}/{{test}}
@markus2330

This comment has been minimized.

Contributor

markus2330 commented Nov 18, 2018

Thank you! Yes, I like the regex idea. It also has advantages for other tools (like a "mmv"-like functionality). And it is in general very useful for external tools if we can specify a regex.

The regex does not seem to be complete, though. Single backslashes are generally allowed (to be used literally) but in some situations they are used for escaping (., %), see also src/libs/elektra/keyname.c

cat <<'EOF' |
level1/\./level3
level1/\../level3
level1/\%/level3
level1/\x/level3
level1/x\x/level3
EOF
sed -E 's/([^\\/]|(\\\\)*\\\/|(\\\\))+/{{\0}}/g'

commented output:

{{level1}}/\{{.}}/{{level3}}  # ok
{{level1}}/\{{..}}/{{level3}}  # ok
{{level1}}/\{{%}}/{{level3}}  # ok
{{level1}}/\{{x}}/{{level3}}  # backslash should be part of key name part in level2
{{level1}}/{{x}}\{{x}}/{{level3}}  # should not introduce a new level
@kodebach

This comment has been minimized.

Contributor

kodebach commented Nov 18, 2018

I modified the regex. It now should always match a full key name part. The only escaping rule taken into account is the one concerned with slashes. Meaning the regex could be used as a substitute for * .

cat <<'EOF' |
test/
other test/
more/test/123
adv\/anced/t\\\/est
another\\\\/test
new\\/test/hello
level1/\./level3
level1/\../level3
level1/\%/level3
level1/\x/level3
level1/x\x/level3
level1/./level3
level1/../level3
level1/%/level3
level1/%level2%/level3
EOF
sed -E 's/(\\?[^\\/]|(\\\\)*(\\\/|\\\\))+/{{\0}}/g'

Commented outputs:

{{test}}/
{{other test}}/
{{more}}/{{test}}/{{123}}
{{adv\/anced}}/{{t\\\/est}}
{{another\\\\}}/{{test}}
{{new\\}}/{{test}}/{{hello}}
{{level1}}/{{\.}}/{{level3}}
{{level1}}/{{\..}}/{{level3}}
{{level1}}/{{\%}}/{{level3}}
{{level1}}/{{\x}}/{{level3}}
{{level1}}/{{x\x}}/{{level3}}
{{level1}}/{{.}}/{{level3}}  # canonically level1/level3 matched as {{level1}}/{{level3}}
{{level1}}/{{..}}/{{level3}} # canonically level3 matched as {{level3}}
{{level1}}/{{%}}/{{level3}}
{{level1}}/{{%level2%}}/{{level3}}

The cases of . and .. are not technically correct, but will not come up in real code, because matching is performed on canonical keys.
Whether or not empty parts % and placeholders like %level2% should have to be matched explicitly, I am not sure.

@markus2330

This comment has been minimized.

Contributor

markus2330 commented Nov 18, 2018

Thank you, looks much better. Yes, the goal for the regex is only to split into key names part, not to implement the semantics.

I wonder if we should provide "elektraKeyGlob" because with regexs this would be quite inefficient (compilation of regex for every key). With larger specifications, most of the execution time is already spent within the spec plugin matching key names, so we should make sure that we do not make the situation worse (I assume fnmatch is already highly optimized).

I'll say the lib is experimental for now and that anything can change in the next version.

@kodebach

This comment has been minimized.

Contributor

kodebach commented Nov 19, 2018

compilation of regex for every key

Yes the regex would be compiled for every call of elektraKeyGlob. But if you want to match against many keys you should really call elektraKsGlob, which could internally be optimised to only compile the regex pattern once.

I assume fnmatch is already highly optimized

According to this its more that POSIX regex.h is badly optimized.

Which is why I now think, using regex without allowing regex input from the user might not be a good idea. However, for mmv-like functionality which needs back references, using PCRE might be worth thinking about. PCRE with its many extensions, would also allow direct translation of _ and # into regex (using recursion and negative lookahead), although this would require extensive testing.

With larger specifications, most of the execution time is already spent within the spec plugin matching key names

We could always provide a second version elektraKeyGlobFast, which restricts the usage of *. Use of * would only be allowed in one of three ways:

  • as a whole key part, i.e. key part can be anything
  • ONCE anywhere inside a key part, i.e. key part must have prefix, suffix or both
  • at the start AND end of a key part, i.e. key part must contain substring
    This behaviour would a simple hand-written implementation possible, it would also allow further optimisations for KeySet.

For the spec plugin, this would not be a problem because * is an undocumented feature anyway.

Additionally elektraKsGlobFast could use a trie-like structure on the basis of key name parts, which would be compiled first (for reuse with different patterns). This way not every key in the KeySet has to be checked.

For example when the KeySet contains abc/def, abc/jkl and test/k?y. The trie structure would be:

                  +
                  |
          +--------------+
          +              +
         abc            test
          +              +
     +----+----+         |
     +         +         +
    def       jkl       key

When matching against the pattern test/* we do:

  1. search test in sorted of children list { abc, test}
  2. go to sub-tree of test, ignore everything else
  3. recursively call on every child of test, because of *
  4. k?y find sub-keys matching pattern and call recursively
  5. add the Key test/key to the result (because it is a leaf)

Here the restriction of * provides significant advantages, because it makes matching easier.
If the whole key part is * every child matches (_ and # are similar if we store separate child lists at construction). And as long as a prefix of the key part is known, we can perform a binary search for the first child matching that prefix and the check the following children. Only if no prefix is known, we need to look at every child.

@markus2330

This comment has been minimized.

Contributor

markus2330 commented Nov 21, 2018

would be compiled for every call of elektraKeyGlob

The question is if we should provide such a function then.

According to this its more that POSIX regex.h is badly optimized.

I do not know about which "POSIX" implementation they refer to. They seem to use "Microsoft Visual C++", maybe it is only a problem of this specific implementation.

PCRE

I think the musl-systems will not so happy about the dependency. Because libglob is a quite basic library hopefully used by basic plugins (like spec) in the future. But as it is not in the core, we actually could use PCRE if there is no way around (e.g. because of performance reasons.)

although this would require extensive testing.

Maybe it is better then to use a more simple and not-so-direct regex?

elektraKsGlobFast

I would very much prefer to have only one API and implementation. Otherwise it is too difficult to maintain. "kdb mmv" is nice to have and should not influence elektraKsGlob too much. In particular it is not good if we leak too many internals. It should not matter if internally regex, fnmatch or a trie is used. Such implementation details are up to you anyway. If we use a trie, we might be able to reuse the existing trie. I think, however, that a trie implementation might be quite cumbersome (even with restrictions). For me glob -> regex seems to be the easier way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment