Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--line-regexp no longer works with --null-data after upgrading to 14.0.1 #2658

Closed
1 task done
Frederick888 opened this issue Nov 27, 2023 · 1 comment
Closed
1 task done
Labels
rollup A PR that has been merged with many others in a rollup.

Comments

@Frederick888
Copy link

Frederick888 commented Nov 27, 2023

Please tick this box to confirm you have reviewed the above.

  • I have a different issue.

What version of ripgrep are you using?

ripgrep 14.0.1

features:-simd-accel,+pcre2
simd(compile):+SSE2,-SSSE3,-AVX2
simd(runtime):+SSE2,+SSSE3,+AVX2

PCRE2 10.42 is available (JIT is available)

How did you install ripgrep?

Arch Linux official repository.

What operating system are you using ripgrep on?

Arch Linux x86_64 (rolling)

Describe your bug.

Previously using ripgrep 13.0.0, when --null-data was given, ripgrep would basically treat NUL the same as an end of a line, and --line-regexp could be used to match a whole line.

After upgrading to 14.0.1, it no longer matches.

What are the steps to reproduce the behavior?

printf '%s\0' 'foo' 'bar' 'hello' 'world' | rg --null-data --line-regexp 'foo'

What is the actual behavior?

It prints nothing to stdout using ripgrep 14.0.1.

DEBUG|rg::flags::parse|crates/core/flags/parse.rs:97: no extra arguments found from configuration file
DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:1093: using heuristics to determine whether to read from stdin or search ./ (is_readable_stdin=true, stdin_consumed=false, mode=Search(Standard))
DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:1106: heuristic chose to search stdin
DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:1254: found hostname for hyperlink configuration: FredArch
DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:1264: hyperlink format: ""
DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:174: using 1 thread(s)
DEBUG|globset|crates/globset/src/lib.rs:448: glob converted to regex: Glob { glob: "**/.DS_*", re: "(?-u)^(?:/?|.*/)\\.DS_[^/]*$", opts: GlobOptions { case_insensitive: false, literal_separator: true, backslash_escape: true, empty_alternates: false }, tokens: Tokens([RecursivePrefix, Literal('.'), Literal('D'), Literal('S'), Literal('_'), ZeroOrMore]) }
DEBUG|globset|crates/globset/src/lib.rs:453: built glob set; 0 literals, 28 basenames, 2 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 1 regexes
DEBUG|grep_regex::config|crates/regex/src/config.rs:175: assembling HIR from 1 fixed string literals
DEBUG|globset|crates/globset/src/lib.rs:453: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

What is the expected behavior?

It prints foo, like 13.0.0 used to do. (I did not see relevant breaking changes in 14.0.0's release announcement.)


Ignoring some dependency patches, git bisect pointed me to 1035f6b.

Bisect script used:

#!/usr/bin/env bash

pushd "~/Programming/Rust/ripgrep" >/dev/null

function cleanup() {
    git reset --hard
    popd >/dev/null
}

trap 'cleanup' EXIT

cat Cargo.toml | rg -v /home/andrew | sponge Cargo.toml

if ! cargo build; then
    exit 125
fi

if ! printf '%s\0' foo bar hello world | cargo run -- --line-regexp --null-data bar >/dev/null; then
    exit 1
fi

exit 0
# bad: [cd5440fb6230f72ab598916c1c5ab96686541d47] changelog: fix wording
# good: [af6b6c543b224d348a8876f0c06245d9ea7929c5] 13.0.0
git bisect start 'master' '13.0.0'
# bad: [c9584b035b19244e370a50fd872a3ae2039e2931] ci/release: use GitHub CLI
git bisect bad c9584b035b19244e370a50fd872a3ae2039e2931
# good: [7f23cd63a51d45415bb3df28d053562844194767] ignore/types: add automated test for sortedness
git bisect good 7f23cd63a51d45415bb3df28d053562844194767
# good: [335aa4937aed8f9c1bf3f5722e8fc4d671d46dcb] ignore/types: add *.pyi for Python
git bisect good 335aa4937aed8f9c1bf3f5722e8fc4d671d46dcb
# bad: [a68db3ac02fbfa2154cb2c8029c39e89d24c2792] deps: drop temporary patch and move to bstr 1.6
git bisect bad a68db3ac02fbfa2154cb2c8029c39e89d24c2792
# bad: [51480d57a67c229d624e5d1edb4aa3782e883696] regex: simplify AST analysis a bit
git bisect bad 51480d57a67c229d624e5d1edb4aa3782e883696
# good: [a7f1276021df2217dead1481b2c2b38595ed8fb3] readme: update Debian instructions
git bisect good a7f1276021df2217dead1481b2c2b38595ed8fb3
# bad: [e028ea37928930c80e5c3172d1df306b85a86758] regex: migrate grep-regex to regex-automata
git bisect bad e028ea37928930c80e5c3172d1df306b85a86758
# bad: [1035f6b1ff502eb5b1a5fc49a79f45971c772d47] deps: initial migration steps to regex 1.9
git bisect bad 1035f6b1ff502eb5b1a5fc49a79f45971c772d47
# first bad commit: [1035f6b1ff502eb5b1a5fc49a79f45971c772d47] deps: initial migration steps to regex 1.9
@BurntSushi
Copy link
Owner

Oh this is a nasty one. It turns out that ripgrep 13 was only working accidentally here. For example, I can make it fail:

$ printf '%s\0' 'bar' 'hello' 'world' | rg-13.0.0 --null-data --line-regexp '\w{3}'
$

And I can even make ripgrep 14 succeed with a sufficiently wacky regex:

$ printf '%s\0' 'foo' 'bar' 'hello' 'world' | rg --null-data --line-regexp '[A-Z]*bar|[A-Y]*blahblah'
bar%

What is happening here are literal optimizations. When searching for foo with ripgrep 13, ripgrep builds a regex like (?m:^)foo(?m:$), and because the old regex engine couldn't handle literal optimizations involving line anchors, ripgrep plucked out foo as an inner literal and scanned for that first. Once it found a "line" containing foo, it strips the line terminators and then runs the full regex. Thus, it finds a match.

But in ripgrep 14, the regex engine is a lot more sophisticated and can handle the literal optimization on its own. So ripgrep backs off and let's it do its thing. But of course, in this context, the regex engine is running on potentially many lines at once, and so the (?m:^) and (?m:$) really need to be able to match. By default, they only match \n. And thus, no match is found because NUL is used as a line terminator.

Thankfully, in the rewrite of the regex engine, I added a new config option to set the line terminator used by (?m:^) and (?m:$). Indeed, I did it for exactly this use case. I just haven't gotten around to making ripgrep use that option yet. Its handling of line terminators is pretty twisted and needs to be unfucked.

I think as a stop-gap, I'll make ripgrep divert to the "slower" line-by-line search when --null-data is used. That will ensure correct operation at least.

@BurntSushi BurntSushi added the rollup A PR that has been merged with many others in a rollup. label Nov 28, 2023
BurntSushi added a commit that referenced this issue Nov 28, 2023
As the FIXME comment says, ripgrep is not yet using the new line
terminator option in regex-automata exposed for exactly this purpose.
Because of that, line anchors like `(?m:^)` and `(?m:$)` will only match
`\n` as a line terminator. This means that when --null-data is used in
combination with --line-regexp, the anchors inserted by --line-regexp
will not match correctly. This is only a big deal in the "fast" path,
which requires the regex engine to deal with line terminators itself
correctly. The slow path strips line terminators regardless of what they
are, and so the line anchors can match (begin/end of haystack).

Fixes #2658
BurntSushi added a commit that referenced this issue Nov 28, 2023
As the FIXME comment says, ripgrep is not yet using the new line
terminator option in regex-automata exposed for exactly this purpose.
Because of that, line anchors like `(?m:^)` and `(?m:$)` will only match
`\n` as a line terminator. This means that when --null-data is used in
combination with --line-regexp, the anchors inserted by --line-regexp
will not match correctly. This is only a big deal in the "fast" path,
which requires the regex engine to deal with line terminators itself
correctly. The slow path strips line terminators regardless of what they
are, and so the line anchors can match (begin/end of haystack).

Fixes #2658
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Nov 28, 2023
14.0.2 (2023-11-27)
===================
This is a patch release with a few small bug fixes.

Bug fixes:

* [BUG #2654](BurntSushi/ripgrep#2654):
  Fix `deb` release sha256 sum file.
* [BUG #2658](BurntSushi/ripgrep#2658):
  Fix partial regression in the behavior of `--null-data --line-regexp`.
* [BUG #2659](BurntSushi/ripgrep#2659):
  Fix Fish shell completions.
* [BUG #2662](BurntSushi/ripgrep#2662):
  Fix typo in documentation for `-i/--ignore-case`.


14.0.1 (2023-11-26)
===================
This a patch release meant to fix `cargo install ripgrep` on Windows.

Bug fixes:

* [BUG #2653](BurntSushi/ripgrep#2653):
  Include `pkg/windows/Manifest.xml` in crate package.


14.0.0 (2023-11-26)
===================
ripgrep 14 is a new major version release of ripgrep that has some new
features, performance improvements and a lot of bug fixes.

The headlining feature in this release is hyperlink support. In this release,
they are an opt-in feature but may change to an opt-out feature in the future.
To enable them, try passing `--hyperlink-format default`. If you use [VS Code],
then try passing `--hyperlink-format vscode`. Please [report your experience
with hyperlinks][report-hyperlinks], positive or negative.

[VS Code]: https://code.visualstudio.com/
[report-hyperlinks]: BurntSushi/ripgrep#2611

Another headlining development in this release is that it contains a rewrite
of its regex engine. You generally shouldn't notice any changes, except for
some searches may get faster. You can read more about the [regex engine rewrite
on my blog][regex-internals]. Please [report your performance improvements or
regressions that you notice][report-perf].

[report-perf]: BurntSushi/ripgrep#2652

Finally, ripgrep switched the library it uses for argument parsing. Users
should not notice a difference in most cases (error messages have changed
somewhat), but flag overrides should generally be more consistent. For example,
things like `--no-ignore --ignore-vcs` work as one would expect (disables all
filtering related to ignore rules except for rules found in version control
systems such as `git`).

[regex-internals]: https://blog.burntsushi.net/regex-internals/

**BREAKING CHANGES**:

* `rg -C1 -A2` used to be equivalent to `rg -A2`, but now it is equivalent to
  `rg -B1 -A2`. That is, `-A` and `-B` no longer completely override `-C`.
  Instead, they only partially override `-C`.

Build process changes:

* ripgrep's shell completions and man page are now created by running ripgrep
with a new `--generate` flag. For example, `rg --generate man` will write a
man page in `roff` format on stdout. The release archives have not changed.
* The optional build dependency on `asciidoc` or `asciidoctor` has been
dropped. Previously, it was used to produce ripgrep's man page. ripgrep now
owns this process itself by writing `roff` directly.

Performance improvements:

* [PERF #1746](BurntSushi/ripgrep#1746):
  Make some cases with inner literals faster.
* [PERF #1760](BurntSushi/ripgrep#1760):
  Make most searches with `\b` look-arounds (among others) much faster.
* [PERF #2591](BurntSushi/ripgrep#2591):
  Parallel directory traversal now uses work stealing for faster searches.
* [PERF #2642](BurntSushi/ripgrep#2642):
  Parallel directory traversal has some contention reduced.

Feature enhancements:

* Added or improved file type filtering for Ada, DITA, Elixir, Fuchsia, Gentoo,
  Gradle, GraphQL, Markdown, Prolog, Raku, TypeScript, USD, V
* [FEATURE #665](BurntSushi/ripgrep#665):
  Add a new `--hyperlink-format` flag that turns file paths into hyperlinks.
* [FEATURE #1709](BurntSushi/ripgrep#1709):
  Improve documentation of ripgrep's behavior when stdout is a tty.
* [FEATURE #1737](BurntSushi/ripgrep#1737):
  Provide binaries for Apple silicon.
* [FEATURE #1790](BurntSushi/ripgrep#1790):
  Add new `--stop-on-nonmatch` flag.
* [FEATURE #1814](BurntSushi/ripgrep#1814):
  Flags are now categorized in `-h/--help` output and ripgrep's man page.
* [FEATURE #1838](BurntSushi/ripgrep#1838):
  An error is shown when searching for NUL bytes with binary detection enabled.
* [FEATURE #2195](BurntSushi/ripgrep#2195):
  When `extra-verbose` mode is enabled in zsh, show extra file type info.
* [FEATURE #2298](BurntSushi/ripgrep#2298):
  Add instructions for installing ripgrep using `cargo binstall`.
* [FEATURE #2409](BurntSushi/ripgrep#2409):
  Added installation instructions for `winget`.
* [FEATURE #2425](BurntSushi/ripgrep#2425):
  Shell completions (and man page) can be created via `rg --generate`.
* [FEATURE #2524](BurntSushi/ripgrep#2524):
  The `--debug` flag now indicates whether stdin or `./` is being searched.
* [FEATURE #2643](BurntSushi/ripgrep#2643):
  Make `-d` a short flag for `--max-depth`.
* [FEATURE #2645](BurntSushi/ripgrep#2645):
  The `--version` output will now also contain PCRE2 availability information.

Bug fixes:

* [BUG #884](BurntSushi/ripgrep#884):
  Don't error when `-v/--invert-match` is used multiple times.
* [BUG #1275](BurntSushi/ripgrep#1275):
  Fix bug with `\b` assertion in the regex engine.
* [BUG #1376](BurntSushi/ripgrep#1376):
  Using `--no-ignore --ignore-vcs` now works as one would expect.
* [BUG #1622](BurntSushi/ripgrep#1622):
  Add note about error messages to `-z/--search-zip` documentation.
* [BUG #1648](BurntSushi/ripgrep#1648):
  Fix bug where sometimes short flags with values, e.g., `-M 900`, would fail.
* [BUG #1701](BurntSushi/ripgrep#1701):
  Fix bug where some flags could not be repeated.
* [BUG #1757](BurntSushi/ripgrep#1757):
  Fix bug when searching a sub-directory didn't have ignores applied correctly.
* [BUG #1891](BurntSushi/ripgrep#1891):
  Fix bug when using `-w` with a regex that can match the empty string.
* [BUG #1911](BurntSushi/ripgrep#1911):
  Disable mmap searching in all non-64-bit environments.
* [BUG #1966](BurntSushi/ripgrep#1966):
  Fix bug where ripgrep can panic when printing to stderr.
* [BUG #2046](BurntSushi/ripgrep#2046):
  Clarify that `--pre` can accept any kind of path in the documentation.
* [BUG #2108](BurntSushi/ripgrep#2108):
  Improve docs for `-r/--replace` syntax.
* [BUG #2198](BurntSushi/ripgrep#2198):
  Fix bug where `--no-ignore-dot` would not ignore `.rgignore`.
* [BUG #2201](BurntSushi/ripgrep#2201):
  Improve docs for `-r/--replace` flag.
* [BUG #2288](BurntSushi/ripgrep#2288):
  `-A` and `-B` now only each partially override `-C`.
* [BUG #2236](BurntSushi/ripgrep#2236):
  Fix gitignore parsing bug where a trailing `\/` resulted in an error.
* [BUG #2243](BurntSushi/ripgrep#2243):
  Fix `--sort` flag for values other than `path`.
* [BUG #2246](BurntSushi/ripgrep#2246):
  Add note in `--debug` logs when binary files are ignored.
* [BUG #2337](BurntSushi/ripgrep#2337):
  Improve docs to mention that `--stats` is always implied by `--json`.
* [BUG #2381](BurntSushi/ripgrep#2381):
  Make `-p/--pretty` override flags like `--no-line-number`.
* [BUG #2392](BurntSushi/ripgrep#2392):
  Improve global git config parsing of the `excludesFile` field.
* [BUG #2418](BurntSushi/ripgrep#2418):
  Clarify sorting semantics of `--sort=path`.
* [BUG #2458](BurntSushi/ripgrep#2458):
  Make `--trim` run before `-M/--max-columns` takes effect.
* [BUG #2479](BurntSushi/ripgrep#2479):
  Add documentation about `.ignore`/`.rgignore` files in parent directories.
* [BUG #2480](BurntSushi/ripgrep#2480):
  Fix bug when using inline regex flags with `-e/--regexp`.
* [BUG #2505](BurntSushi/ripgrep#2505):
  Improve docs for `--vimgrep` by mentioning footguns and some work-arounds.
* [BUG #2519](BurntSushi/ripgrep#2519):
  Fix incorrect default value in documentation for `--field-match-separator`.
* [BUG #2523](BurntSushi/ripgrep#2523):
  Make executable searching take `.com` into account on Windows.
* [BUG #2574](BurntSushi/ripgrep#2574):
  Fix bug in `-w/--word-regexp` that would result in incorrect match offsets.
* [BUG #2623](BurntSushi/ripgrep#2623):
  Fix a number of bugs with the `-w/--word-regexp` flag.
* [BUG #2636](BurntSushi/ripgrep#2636):
  Strip release binaries for macOS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rollup A PR that has been merged with many others in a rollup.
Projects
None yet
Development

No branches or pull requests

2 participants