Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--multiline with --replace causes duplicate output #2095

Closed
balupton opened this issue Dec 2, 2021 · 4 comments · Fixed by #2209
Closed

--multiline with --replace causes duplicate output #2095

balupton opened this issue Dec 2, 2021 · 4 comments · Fixed by #2209
Labels
bug A bug.

Comments

@balupton
Copy link

balupton commented Dec 2, 2021

What version of ripgrep are you using?

ripgrep 13.0.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

cargo

What operating system are you using ripgrep on?

Darwin redacted 21.1.0 Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64 x86_64

Describe your bug.

cat <<EOF > test.bash
#!/usr/bin/env bash

zero=one

a=one

if true; then
	a=(
		a
		b
		c
	)
	true
fi

a=two

b=one
EOF

# all of these are the same


rg  --multiline --only-matching '^(?P<indent>\s*)a=(?P<value>(?ms:[(].*?[)])|.*?)$' --replace '${value}' ./test.bash

rg  --multiline --only-matching '^(?P<indent>\s*)a=(?P<value>[(](?ms:.*?)[)]|.*?)$' --replace '${value}'  test.bash

rg  --multiline --only-matching '^(?P<indent>\s*)a=(?P<value>[(](?ms:.*?)[)]|[^(].*?)$' --replace '${value}'  test.bash

rg  --multiline --only-matching '^(?P<indent>\s*)a=(?P<value>[(](?ms:.*?)[)]|[^(](?-ms:.*?))$' --replace '${value}' test.bash

rg  --multiline --only-matching '^(?-ms:(?P<indent>\s*)a=(?P<value>[(](?ms:.*?)[)]|[^(].*?))$' --replace '${value}' test.bash

rg  --multiline --only-matching '(?-ms:^(?P<indent>\s*)a=(?P<value>[(](?ms:.*?)[)]|[^(].*?)$)' --replace '${value}' test.bash
# ^ this fails, rg really doesn't like flag modifiers prior to ^ and $

rg --multiline --only-matching '(?ms:^(?P<indent>\s*)a=(?P<value>[(].*?[)]|[^\n]*)$)' --replace '${value}' ./test.bash

rg --multiline --only-matching '(?ms:^(?P<indent>\s*)a=(?P<value>[(].*?[)]|(?-ms:.*))$)' --replace '${value}' ./test.bash

rg --multiline --only-matching '(?m:^(?P<indent>\s*)a=(?P<value>[(](?s:.*?)[)]|.*)$)' --replace '${value}' ./test.bash

outputs:

4:one
7:(
8:		q
9:		r
10:		s
11:	)
14:two
8:(
9:		q
10:		r
11:		s
12:	)
15:two
15:two

Instead of:

4:one
7:(
8:		q
9:		r
10:		s
11:	)
14:two

What are the steps to reproduce the behavior?

see above

What is the actual behavior?

see above

What is the expected behavior?

see above

What do you think ripgrep should have done?

It would be good if the --multiline handling could be disabled, in favour of the rust regex localised flag applied (?ms:

@balupton
Copy link
Author

balupton commented Dec 2, 2021

Without the --replace, it works fine, no duplicates, but returns the entire match:

rg --multiline --only-matching '(?ms:^(?P<indent>\s*)a=(?P<value>[(].*?[)]|[^\n]*)$)'  ./test.bash

rg --multiline --only-matching '(?ms:^(?P<indent>\s*)a=(?P<value>[(].*?[)]|(?-ms:.*))$)'  ./test.bash

rg --multiline --only-matching '(?m:^(?P<indent>\s*)a=(?P<value>[(](?s:.*?)[)]|.*)$)' ./test.bash

output:

5:a=one
8:	a=(
9:		a
10:		b
11:		c
12:	)
16:a=two

@balupton balupton changed the title --multiline is causing duplicate output --multiline with --replace causes duplicate output Dec 2, 2021
@balupton
Copy link
Author

balupton commented Dec 2, 2021

For the meantime, --max-count=1 is a suitable workaround.

@BurntSushi BurntSushi added the bug A bug. label Dec 2, 2021
@BurntSushi
Copy link
Owner

Yup, looks like a bug to me. Thanks for the report!

@ghost
Copy link

ghost commented Feb 23, 2022

I ran into this problem when working on a unified diff printer. It seems the replacement bytes retain the remainder of the subject buffer instead of only until the match end. I tried to fix it and it worked, but I'm not sure if my fix is a proper one.
See https://github.com/BurntSushi/ripgrep/pull/2149/files#diff-cbced35c4a7b8ae41aa579dd0395a63152e23d6edf91d2e1035f189d1d3337da

BurntSushi added a commit that referenced this issue May 11, 2022
This furthers our kludge of dealing with PCRE2's look-around in the
printer. Because of our bad abstraction boundaries, we added a kludge to
deal with PCRE2 look-around by extending the bytes we search by a fixed
amount to hopefully permit any look-around to operate. But because of
that kludge, we wind up over extending ourselves in some cases and
dragging along those extra bytes.

We had fixed this for simple searching by simply rejecting any matches
past the end point. But we didn't do the same for replacements. So this
commit extends our kludge to replacements.

Thanks to @sonohgong for diagnosing the problem and proposing a fix. I
mostly went with their solution, but adding the new replacement routine
as an internal helper rather than a new APIn in the 'grep-matcher'
crate.

Fixes #2095, Fixes #2208
BurntSushi added a commit that referenced this issue May 11, 2022
This furthers our kludge of dealing with PCRE2's look-around in the
printer. Because of our bad abstraction boundaries, we added a kludge to
deal with PCRE2 look-around by extending the bytes we search by a fixed
amount to hopefully permit any look-around to operate. But because of
that kludge, we wind up over extending ourselves in some cases and
dragging along those extra bytes.

We had fixed this for simple searching by simply rejecting any matches
past the end point. But we didn't do the same for replacements. So this
commit extends our kludge to replacements.

Thanks to @sonohgong for diagnosing the problem and proposing a fix. I
mostly went with their solution, but adding the new replacement routine
as an internal helper rather than a new APIn in the 'grep-matcher'
crate.

Fixes #2095, Fixes #2208
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug.
Projects
None yet
2 participants