Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] block matching in ugrep #369

Closed
trantor opened this issue Mar 9, 2024 · 18 comments
Closed

[RFE] block matching in ugrep #369

trantor opened this issue Mar 9, 2024 · 18 comments
Labels
question A question that has or needs further clarification works as documented Works (with a minor adjustment) as stated in the documentation

Comments

@trantor
Copy link
Contributor

trantor commented Mar 9, 2024

Hello hello.
I've been wondering whether it could be feasible to implement a block-level or paragraph match mode for ugrep.
There are a few implementations in the wild doing just that, but they usually are dedicated tools.

By block-level I mean a tool which would match on a block of contiguous lines defined by regex patterns matching one a start line and another an end line of the block.
It would be interesting in order to allow, for instance, to match and return only matching blocks or only non-matching blocks against a given pattern.

Thanks in advance.

@genivia-inc genivia-inc added the question A question that has or needs further clarification label Mar 10, 2024
@genivia-inc
Copy link
Member

genivia-inc commented Mar 10, 2024

This works out-of-the-box with ugrep with lazy quantifiers and \n newline matching. The block of text between two matching lines is matched by the lazy pattern (.*\n)*?, which is followed by the ending line pattern to end the lazily-matching block.

Given a BEGIN and an END pattern to match the begin and end line of the block to match:

$ ug 'BEGIN.*\n(.*\n)*?.*END' FILE

No new features are necessary. But perhaps the syntax is a bit off-putting, although it is short and only requires placing the pattern .*\n(.*\n)*?.* between BEGIN and END.

To match a BEGIN and an END pattern also on the same line as well as a block:

$ ug 'BEGIN(.*\n)*?.*END' FILE

@genivia-inc genivia-inc added the works as documented Works (with a minor adjustment) as stated in the documentation label Mar 10, 2024
@genivia-inc
Copy link
Member

For example, to block-match C++ /*-*/ block comments look for regex /\* then match text with lazy regex (.*\n)*? and end with the regex .*\*+\/:

% ug '/\*(.*\n)*?.*\*+\/' file.cpp

@trantor
Copy link
Contributor Author

trantor commented Mar 10, 2024

Thanks @genivia-inc . Just an additional question at this point. What if I wanted to invert the match and output only the blocks that don't match a specific pattern but do match the BEGIN and END criteria?
Would that require a 2-step process, one where I first match blocks using the BEGIN and END regexes as start and end of the matching and a second one where I invert the match and use BEGIN and END patterns with the matching pattern inbetween them?
(I hope I was clear enough.)

@genivia-inc
Copy link
Member

genivia-inc commented Mar 10, 2024

Option -v inverts matches. But that won't include the begin and end line, which is what you want.

To include begin and end lines, it will be necessary to use option -P with three patterns:

  1. from the start of the file anchor \A to a line that matches BEGIN: \A(.*\n)*?.*BEGIN
  2. from a line that matches END to a line that matches BEGIN, i.e. inverted order: END.*\n(.*\n)*?.*BEGIN
  3. from a line that matches END to the end of the file anchor \Z: END(.*\n)*\Z
$ ug -P -e '\A(.*\n)*?.*BEGIN' -e 'END.*\n(.*\n)*?.*BEGIN' -e 'END(.*\n)*\Z' FILE

where the third pattern doesn't need to use a lazy quantifier since it matches greedily to the end of the file. Because of this lazy/non-lazy "conflict" at a zero-width anchor \Z, it's best to combine the second and third patterns:

$ ug -P -e '\A(.*\n)*?.*BEGIN' -e 'END.*\n(.*\n)*?(.*BEGIN|\Z)' FILE

All of this assumes that BEGIN ... END isn't nested. Nested blocks cannot be handled with regex, because it requires a context-free grammar, not a linear grammar. To do so requires a parser.

@genivia-inc
Copy link
Member

genivia-inc commented Mar 11, 2024

I've updated the previous post, because anchors \A and \Z require option -P for Perl regex (with PCRE2) for this to work. I forgot to mention this. Otherwise, with POSIX lazy matching and anchors the matches are too long.

I'll take a look at the anchors with POSIX lazy quantifiers (without option -P), which in principle should work find with POSIX regex matching. But it seems that the greediness of my POSIX regex engine matches too much when anchors are used.

@genivia-inc
Copy link
Member

OK. I've made a few minor changes to permit the use of anchors such as \A and \Z with lazy patterns, so -P is not necessary to accomplish the same here.

Will release as update 5.1.1 soon.

@genivia-inc
Copy link
Member

genivia-inc commented Mar 11, 2024

Fixed. It appeared to be a minor problem with the DFA construction algorithm that was updated, causing anchors combined with lazy quantifiers to become too greedy. I have a set of unit tests. But tests are not 100% covering combinations like this one.

@genivia-inc
Copy link
Member

I'm adding this issue to the Discussions tab, so folks who have the same question can find an answer (that is not closed).

@vt-alt
Copy link

vt-alt commented Apr 15, 2024

Is it possible to print a whole paragraph (text separated by empty lines) if paragraph contains some match?

For example I get 3 matches with awk:

ugrep (master)$ git log | awk '/released 4.5/' RS= ORS='\n\n' | cat -A
    released 4.5.2$
    $
    - PR #344$
    - PR #343$
    - fix 7zip search on 32 bit systems when option -M (or -t setting -M) is used$
$
    released 4.5.1 fix bzip3/7zip configure interference$
    $
    Reported here:  https://github.com/Genivia/ugrep-indexer/issues/9$
$
    released 4.5.0$
    $
    - add 7zip archive search with option -z, fix #185$
    - apply Unicode normalization to combining characters in regex patterns, fix #298$
    - improved TUI TAB directory navigation when searching from the FS root$
    - updated ugrep.exe option -P to use PCRE2 10.42$
$

(Added cat -A so it visible it is actually 3 paragraphs.)

I heard that some old grep implementations had -p option for this.

@genivia-inc
Copy link
Member

Never heard of the -p option, so that's an interesting historical note. In BSD grep option -p works as follows: "If -R is specified, no symbolic links are followed. This is the default." so it's definitely not something universal.

Use .*PATTERN(.|\n)*?\n\n to match a bunch of lines that end with an empty line. This uses the lazy quantifier (.\n)*? to match anything, including newlines up to the \n\n that follows the lazy pattern. The second \n will however also trigger a match for the line that follows, which is also shown as a result. This can be suppressed with option -o:

$ ug -o '.*PATTERN(.|\n)*?\n\n' file.txt

@vt-alt
Copy link

vt-alt commented Apr 16, 2024

Thanks for the answer!

I just remembered this -p was from AIX: https://www.ibm.com/docs/fr/aix/7.1?topic=g-grep-command

@vt-alt
Copy link

vt-alt commented Apr 16, 2024

Well, starting with .*PATTERN... it only prints the first matching line of paragraph until its end, not the whole paragraph (like with awk example). Example

$ git log | ug '.*ugrep-indexer/issues(.|\n)*?\n\n' | cat -A
    Reported here:  https://github.com/Genivia/ugrep-indexer/issues/9$
$
commit 7f18bb90c1b548e0749d660d27d1c501ed80f8a4$

Also, this commit line seems extraneous. (With -o it disappears).

@genivia-inc
Copy link
Member

Right, I had mentioned option -o. The reason is that the second \n in \n\n no longer separates the match from the "extra" line that is added to the result. That's because \n separates lines. Matching \n also includes what comes after.

$ ug -o '(\n.+)*PATTERN(.|\n)*?\n\n' file.txt

Other patterns are possible, but some may cause backtracking that impacts performance. Starting with an \n gets results quicker (caveat: but it won't include the first line of a file when that paragraph matches)

@genivia-inc
Copy link
Member

To avoid the extra line without -o, a lookahead can be used but this is lookahead is not documented because it can only be used at the end of a pattern, not inside like Perl option -P allows:

$ ug '(\n.+)*.*PATTERN(.|\n)*?\n(?=\n)' file.txt

@vt-alt
Copy link

vt-alt commented Apr 16, 2024

Matching \n also includes what comes after.

That's non-intuitive. Also, this approach (as in the example) misses first and last paragraphs (because there could be no \n). So the task becomes kinda complicated.

Just thinking: maybe it would be cool to have additional pattern like -o that will be matched and printed IF second (-e) pattern matches inside of it.

@genivia-inc
Copy link
Member

genivia-inc commented Apr 16, 2024

It is important to consider the fact that ugrep like other grep is not a tool like awk that separates input into records and records can be separated into fields.

Rather, ugrep is a (multi)line-oriented search tool. It does this efficiently in a streaming way, i.e. by not storing an entire file in memory. That is critical when searching recursively to not bog down a machine when files are large. In fact, ugrep has no limit on file size to search.

@genivia-inc
Copy link
Member

genivia-inc commented Apr 16, 2024

Matching \n also includes what comes after.

That's non-intuitive. Also, this approach (as in the example) misses first and last paragraphs (because there could be no \n). So the task becomes kinda complicated.

No, it is not. Let me explain why. It is how the default matching works, i.e. anything that matches on a line results in outputting that entire line (unless option -o is used.) So when matching an \n we increment the line counter to the next line because from that point on we are on the next line already.

For example, say the input is:

abc\ndef\n

then matching c or c$ gives abc and matching c\nd gives abc\ndef and matching c\n or \nd also gives abc\ndef. The \n is just another character when matched.

You're probably thinking of \n\n as a record separator that is an empty line. But for grepping it isn't. It's just a sequence of characters to match and show in which lines it occurred.

@vt-alt
Copy link

vt-alt commented Apr 16, 2024

I meant "non-intuitive" that \n is (somehow) appearing at the start of the line (so grepping for it also matches a full following line).

This is slightly different from ripgrep:

$ echo -e 'a\nb\nc\n' | ugrep -e 'b\n'
b
c
$ echo -e 'a\nb\nc\n' | rg -U -e 'b\n'
b
$ 

But when prepending \n they curiously behave the same:

$ echo -e 'a\nb\nc\n' | ugrep -e '\nb'
a
b
$ echo -e 'a\nb\nc\n' | rg -U -e '\nb'
a
b
$ 

Judging from this, for ripgrep, it seems, that \n is always at the end of the line (which is intuitive), but for ugrep it seems "something-not-at-this-line" and thus matching next/previous lines (or an 'non-existent' empty line for end of file).

I am not criticizing your approach, though, it's just curious observation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question A question that has or needs further clarification works as documented Works (with a minor adjustment) as stated in the documentation
Projects
None yet
Development

No branches or pull requests

3 participants