Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior of --sort modified, giving incorrect order #2243

Closed
qx220 opened this issue Jun 24, 2022 · 7 comments
Closed

Strange behavior of --sort modified, giving incorrect order #2243

qx220 opened this issue Jun 24, 2022 · 7 comments
Labels
bug A bug. help wanted Others are encouraged to work on this issue.

Comments

@qx220
Copy link

qx220 commented Jun 24, 2022

What version of ripgrep are you using?

ripgrep 11.0.2
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

apt-get

What operating system are you using ripgrep on?

5.15.0-33-generic #34~20.04.1-Ubuntu SMP Thu May 19 15:51:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Describe your bug.

When using the vim plugin LeaderF, which uses rg as its search engine, I realized that the sorting order seems to be "randomly" incorrect.

After a bit of tweaking, I produced a minimal script reproducing the issue.

What are the steps to reproduce the behavior?

Under a clean temporary folder:

echo string > modified_first; sleep 1; mkdir -p dir; echo string > dir/modified_second; find -type f | xargs stat

At this point, rg returns the correct order when executing:

rg --sort modified string

However, if we modify the two files now, the order becomes incorrect:

echo string > modified_first; sleep 1; mkdir -p dir; echo string > dir/modified_second; find -type f | xargs stat # as the output shows, "./modified_first" is modified before "./dir/modified_second"

Now if we run rg --sort modified string, we find the following output:

dir/modified_second
1:string

modified_first
1:string

when the expected behavior should be:

modified_first
1:string

dir/modified_second
1:string

I'm attaching the full transcript with output below:

l14:~/tmp/ripgrep$ echo string > modified_first; sleep 1; mkdir -p dir; echo string > dir/modified_second; find -type f | xargs stat
  File: ./modified_first
  Size: 7         	Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d	Inode: 9699609     Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/    user)   Gid: ( 1000/    user)
Access: 2022-06-23 22:29:05.404139561 -0400
Modify: 2022-06-23 22:29:05.404139561 -0400
Change: 2022-06-23 22:29:05.404139561 -0400
 Birth: -
  File: ./dir/modified_second
  Size: 7         	Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d	Inode: 9699610     Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/    user)   Gid: ( 1000/    user)
Access: 2022-06-23 22:29:06.412148288 -0400
Modify: 2022-06-23 22:29:06.412148288 -0400
Change: 2022-06-23 22:29:06.412148288 -0400
 Birth: -
l14:~/tmp/ripgrep$ rg --sort modified string
modified_first
1:string

dir/modified_second
1:string
l14:~/tmp/ripgrep$ echo string > modified_first; sleep 1; mkdir -p dir; echo string > dir/modified_second; find -type f | xargs stat
  File: ./modified_first
  Size: 7         	Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d	Inode: 9699609     Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/    user)   Gid: ( 1000/    user)
Access: 2022-06-23 22:29:12.280199102 -0400
Modify: 2022-06-23 22:29:20.040266325 -0400
Change: 2022-06-23 22:29:20.040266325 -0400
 Birth: -
  File: ./dir/modified_second
  Size: 7         	Blocks: 8          IO Block: 4096   regular file
Device: fd00h/64768d	Inode: 9699610     Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/    user)   Gid: ( 1000/    user)
Access: 2022-06-23 22:29:12.280199102 -0400
Modify: 2022-06-23 22:29:21.044275025 -0400
Change: 2022-06-23 22:29:21.044275025 -0400
 Birth: -
l14:~/tmp/ripgrep$ rg --sort modified string
dir/modified_second
1:string

modified_first
1:string
l14:~/tmp/ripgrep$ 

Potential explanation

I don't quite understand why this issue pops up, but I realized that this error does not exist if modified_second is not under dir directory.

So I did a touch dir, which seems to give the correct order. My speculation is that, --sort modified somehow takes into consideration the modtime of a file's parent directory instead of the file itself

@BurntSushi
Copy link
Owner

Nice bug report, thank you.

Sadly, yeah, this is a bug and it's rooted in how I implemented sorting. Namely, sorting is not fully correct. The sorting happens during directory traversal. So when you read the first set of directory entries (before recursing), those entries get sorted. Then recursion happens in the first directory seen, and those directory entries are themselves sorted.

In other words, the sorting implementation relies on the property that sorting the parent directory preserves the sorting order of its children. That works with --sort path, but probably not for anything else. For other things, probably sorting needs to happen globally. Which is pretty bad. It means ripgrep needs to collect everything into memory first and then sort the results. The idea with the current strategy is that it avoids needing to read everything into memory. But it seems quite foolish in retrospect.

So... definitely a bug, but sadly, probably not one that I'll have time to fix soon personally. If someone else wants to submit a patch though, that would be fine. However, I'd encourage discussing the implementation path with me first. It will likely require collecting all file paths first, sorting those, and then running the search. Sounds easy, but will likely require a fair bit of refactoring unfortunately.

@BurntSushi BurntSushi added bug A bug. help wanted Others are encouraged to work on this issue. labels Jun 24, 2022
@BurntSushi
Copy link
Owner

Originally authored by @nguyenvukhang and posted in #2365:

As outlined in #2243, results are incorrectly sorted by file stats (modified/created/accessed time).

One approach is to make the recursive directory traversal smarter. For example, when sorting by latest modified, the parent directory will be assigned the latest modified time of all its children, rather than the direct result of calling stat on it.

Another approach will be to sort the results after all search results have been matched and collected. This will likely collecting all results into memory, tagging each result with a key based on the sorting criteria, and sorting the results by this key.


(I moved the comment here to keep things in one place.)

@PoignardAzur
Copy link

@BurntSushi Is this still something you want help with?

I'm looking to make a contribution to an unfamiliar project so I can test Pernosco, and this issue looks like a good match.

However, I'd encourage discussing the implementation path with me first. It will likely require collecting all file paths first, sorting those, and then running the search.

Sounds a bit memory-intensive, but that probably can't be helped in the general case.

Sounds easy, but will likely require a fair bit of refactoring unfortunately.

Anything specific in mind?

My guess without knowing anything about the codebase is that you would need whatever trait handles file sorting to have a collect_everything_ahead_of_time() -> bool method to contrast with cases where the files can be sorted on the fly.

@Icelk
Copy link

Icelk commented Jul 5, 2023

One approach is to make the recursive directory traversal smarter. For example, when sorting by latest modified, the parent directory will be assigned the latest modified time of all its children, rather than the direct result of calling stat on it.

This is invalid. Either, the directory is walked by first yielding the directory inodes. Then, we don't know their contents, and therefore not the mtime. The other way, where the contents are searched first, also doesn't help us, since the results are printed before we have the results for sorting order. And on top of that, ripgrep relies on the directories-first approach for filtering.

One approach is to walk the directory and get all the mtimes, then walk it with correct sorting. This won't solve our problem either, since we want to sort files in different directories WITH EACH OTHER, and so collecting them in a big list is the only way :(

@Icelk
Copy link

Icelk commented Jul 5, 2023

The diff to get something hacky working (for single-threaded files only, so specity --sort modified). It collects all files. The sort method is hardcoded.

diff --git a/crates/core/main.rs b/crates/core/main.rs
index 2584ea9..38032d2 100644
--- a/crates/core/main.rs
+++ b/crates/core/main.rs
@@ -218,11 +218,19 @@ fn files(args: &Args) -> Result<bool> {
     let subject_builder = args.subject_builder();
     let mut matched = false;
     let mut path_printer = args.path_printer(args.stdout())?;
-    for result in args.walker()? {
-        let subject = match subject_builder.build_from_result(result) {
-            Some(subject) => subject,
-            None => continue,
-        };
+    let mut subjects: Vec<_> = args
+        .walker()?
+        .filter_map(|result| match subject_builder.build_from_result(result) {
+            Some(subject) => Some((subject.metadata().ok()?, subject)),
+            None => None,
+        })
+        .collect();
+
+    subjects.sort_unstable_by(|a, b| {
+        a.0.modified().unwrap().cmp(&b.0.modified().unwrap()).reverse()
+    });
+
+    for (_, subject) in subjects {
         matched = true;
         if quit_after_match {
             break;
diff --git a/crates/core/subject.rs b/crates/core/subject.rs
index 06511c5..8fd505c 100644
--- a/crates/core/subject.rs
+++ b/crates/core/subject.rs
@@ -1,6 +1,7 @@
+use std::fs::Metadata;
 use std::path::Path;
 
-use ignore::{self, DirEntry};
+use ignore::{self, DirEntry, Error};
 use log;
 
 /// A configuration for describing how subjects should be built.
@@ -112,6 +113,9 @@ impl Subject {
             self.dent.path()
         }
     }
+    pub fn metadata(&self) -> Result<Metadata, Error> {
+        self.dent.metadata()
+    }
 
     /// Returns true if and only if this entry corresponds to stdin.
     pub fn is_stdin(&self) -> bool {

@Icelk
Copy link

Icelk commented Jul 5, 2023

I observe only a ~20% performance hit (with a ~2s runtime, no colours, piped to /dev/null) as compared to the current sort algo. That seems worth it, since now it actually gives correct results. If you'd like, I can look into making a PR & implementing this for the parallel case.

BurntSushi pushed a commit that referenced this issue Jul 9, 2023
Previously, sorting worked by sorting the parents and then sorting the
children within each parent. This was done during traversal, but it only
works when sorting parents preserves the overall order. This generally
only works for '--sort path' in ascending order.

This commit fixes the rest of the sorting behavior by collecting all of
the paths to search and then sorting them before searching. We only
collect all of the paths when sorting was requested.

Fixes #2243, Closes #2361
BurntSushi pushed a commit that referenced this issue Jul 9, 2023
Previously, sorting worked by sorting the parents and then sorting the
children within each parent. This was done during traversal, but it only
works when sorting parents preserves the overall order. This generally
only works for '--sort path' in ascending order.

This commit fixes the rest of the sorting behavior by collecting all of
the paths to search and then sorting them before searching. We only
collect all of the paths when sorting was requested.

Fixes #2243, Closes #2361
BurntSushi pushed a commit that referenced this issue Jul 9, 2023
Previously, sorting worked by sorting the parents and then sorting the
children within each parent. This was done during traversal, but it only
works when sorting parents preserves the overall order. This generally
only works for '--sort path' in ascending order.

This commit fixes the rest of the sorting behavior by collecting all of
the paths to search and then sorting them before searching. We only
collect all of the paths when sorting was requested.

Fixes #2243, Closes #2361
BurntSushi pushed a commit that referenced this issue Jul 9, 2023
Previously, sorting worked by sorting the parents and then sorting the
children within each parent. This was done during traversal, but it only
works when sorting parents preserves the overall order. This generally
only works for '--sort path' in ascending order.

This commit fixes the rest of the sorting behavior by collecting all of
the paths to search and then sorting them before searching. We only
collect all of the paths when sorting was requested.

Fixes #2243, Closes #2361
@Icelk
Copy link

Icelk commented Jul 9, 2023

Thanks!

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Nov 28, 2023
14.0.2 (2023-11-27)
===================
This is a patch release with a few small bug fixes.

Bug fixes:

* [BUG #2654](BurntSushi/ripgrep#2654):
  Fix `deb` release sha256 sum file.
* [BUG #2658](BurntSushi/ripgrep#2658):
  Fix partial regression in the behavior of `--null-data --line-regexp`.
* [BUG #2659](BurntSushi/ripgrep#2659):
  Fix Fish shell completions.
* [BUG #2662](BurntSushi/ripgrep#2662):
  Fix typo in documentation for `-i/--ignore-case`.


14.0.1 (2023-11-26)
===================
This a patch release meant to fix `cargo install ripgrep` on Windows.

Bug fixes:

* [BUG #2653](BurntSushi/ripgrep#2653):
  Include `pkg/windows/Manifest.xml` in crate package.


14.0.0 (2023-11-26)
===================
ripgrep 14 is a new major version release of ripgrep that has some new
features, performance improvements and a lot of bug fixes.

The headlining feature in this release is hyperlink support. In this release,
they are an opt-in feature but may change to an opt-out feature in the future.
To enable them, try passing `--hyperlink-format default`. If you use [VS Code],
then try passing `--hyperlink-format vscode`. Please [report your experience
with hyperlinks][report-hyperlinks], positive or negative.

[VS Code]: https://code.visualstudio.com/
[report-hyperlinks]: BurntSushi/ripgrep#2611

Another headlining development in this release is that it contains a rewrite
of its regex engine. You generally shouldn't notice any changes, except for
some searches may get faster. You can read more about the [regex engine rewrite
on my blog][regex-internals]. Please [report your performance improvements or
regressions that you notice][report-perf].

[report-perf]: BurntSushi/ripgrep#2652

Finally, ripgrep switched the library it uses for argument parsing. Users
should not notice a difference in most cases (error messages have changed
somewhat), but flag overrides should generally be more consistent. For example,
things like `--no-ignore --ignore-vcs` work as one would expect (disables all
filtering related to ignore rules except for rules found in version control
systems such as `git`).

[regex-internals]: https://blog.burntsushi.net/regex-internals/

**BREAKING CHANGES**:

* `rg -C1 -A2` used to be equivalent to `rg -A2`, but now it is equivalent to
  `rg -B1 -A2`. That is, `-A` and `-B` no longer completely override `-C`.
  Instead, they only partially override `-C`.

Build process changes:

* ripgrep's shell completions and man page are now created by running ripgrep
with a new `--generate` flag. For example, `rg --generate man` will write a
man page in `roff` format on stdout. The release archives have not changed.
* The optional build dependency on `asciidoc` or `asciidoctor` has been
dropped. Previously, it was used to produce ripgrep's man page. ripgrep now
owns this process itself by writing `roff` directly.

Performance improvements:

* [PERF #1746](BurntSushi/ripgrep#1746):
  Make some cases with inner literals faster.
* [PERF #1760](BurntSushi/ripgrep#1760):
  Make most searches with `\b` look-arounds (among others) much faster.
* [PERF #2591](BurntSushi/ripgrep#2591):
  Parallel directory traversal now uses work stealing for faster searches.
* [PERF #2642](BurntSushi/ripgrep#2642):
  Parallel directory traversal has some contention reduced.

Feature enhancements:

* Added or improved file type filtering for Ada, DITA, Elixir, Fuchsia, Gentoo,
  Gradle, GraphQL, Markdown, Prolog, Raku, TypeScript, USD, V
* [FEATURE #665](BurntSushi/ripgrep#665):
  Add a new `--hyperlink-format` flag that turns file paths into hyperlinks.
* [FEATURE #1709](BurntSushi/ripgrep#1709):
  Improve documentation of ripgrep's behavior when stdout is a tty.
* [FEATURE #1737](BurntSushi/ripgrep#1737):
  Provide binaries for Apple silicon.
* [FEATURE #1790](BurntSushi/ripgrep#1790):
  Add new `--stop-on-nonmatch` flag.
* [FEATURE #1814](BurntSushi/ripgrep#1814):
  Flags are now categorized in `-h/--help` output and ripgrep's man page.
* [FEATURE #1838](BurntSushi/ripgrep#1838):
  An error is shown when searching for NUL bytes with binary detection enabled.
* [FEATURE #2195](BurntSushi/ripgrep#2195):
  When `extra-verbose` mode is enabled in zsh, show extra file type info.
* [FEATURE #2298](BurntSushi/ripgrep#2298):
  Add instructions for installing ripgrep using `cargo binstall`.
* [FEATURE #2409](BurntSushi/ripgrep#2409):
  Added installation instructions for `winget`.
* [FEATURE #2425](BurntSushi/ripgrep#2425):
  Shell completions (and man page) can be created via `rg --generate`.
* [FEATURE #2524](BurntSushi/ripgrep#2524):
  The `--debug` flag now indicates whether stdin or `./` is being searched.
* [FEATURE #2643](BurntSushi/ripgrep#2643):
  Make `-d` a short flag for `--max-depth`.
* [FEATURE #2645](BurntSushi/ripgrep#2645):
  The `--version` output will now also contain PCRE2 availability information.

Bug fixes:

* [BUG #884](BurntSushi/ripgrep#884):
  Don't error when `-v/--invert-match` is used multiple times.
* [BUG #1275](BurntSushi/ripgrep#1275):
  Fix bug with `\b` assertion in the regex engine.
* [BUG #1376](BurntSushi/ripgrep#1376):
  Using `--no-ignore --ignore-vcs` now works as one would expect.
* [BUG #1622](BurntSushi/ripgrep#1622):
  Add note about error messages to `-z/--search-zip` documentation.
* [BUG #1648](BurntSushi/ripgrep#1648):
  Fix bug where sometimes short flags with values, e.g., `-M 900`, would fail.
* [BUG #1701](BurntSushi/ripgrep#1701):
  Fix bug where some flags could not be repeated.
* [BUG #1757](BurntSushi/ripgrep#1757):
  Fix bug when searching a sub-directory didn't have ignores applied correctly.
* [BUG #1891](BurntSushi/ripgrep#1891):
  Fix bug when using `-w` with a regex that can match the empty string.
* [BUG #1911](BurntSushi/ripgrep#1911):
  Disable mmap searching in all non-64-bit environments.
* [BUG #1966](BurntSushi/ripgrep#1966):
  Fix bug where ripgrep can panic when printing to stderr.
* [BUG #2046](BurntSushi/ripgrep#2046):
  Clarify that `--pre` can accept any kind of path in the documentation.
* [BUG #2108](BurntSushi/ripgrep#2108):
  Improve docs for `-r/--replace` syntax.
* [BUG #2198](BurntSushi/ripgrep#2198):
  Fix bug where `--no-ignore-dot` would not ignore `.rgignore`.
* [BUG #2201](BurntSushi/ripgrep#2201):
  Improve docs for `-r/--replace` flag.
* [BUG #2288](BurntSushi/ripgrep#2288):
  `-A` and `-B` now only each partially override `-C`.
* [BUG #2236](BurntSushi/ripgrep#2236):
  Fix gitignore parsing bug where a trailing `\/` resulted in an error.
* [BUG #2243](BurntSushi/ripgrep#2243):
  Fix `--sort` flag for values other than `path`.
* [BUG #2246](BurntSushi/ripgrep#2246):
  Add note in `--debug` logs when binary files are ignored.
* [BUG #2337](BurntSushi/ripgrep#2337):
  Improve docs to mention that `--stats` is always implied by `--json`.
* [BUG #2381](BurntSushi/ripgrep#2381):
  Make `-p/--pretty` override flags like `--no-line-number`.
* [BUG #2392](BurntSushi/ripgrep#2392):
  Improve global git config parsing of the `excludesFile` field.
* [BUG #2418](BurntSushi/ripgrep#2418):
  Clarify sorting semantics of `--sort=path`.
* [BUG #2458](BurntSushi/ripgrep#2458):
  Make `--trim` run before `-M/--max-columns` takes effect.
* [BUG #2479](BurntSushi/ripgrep#2479):
  Add documentation about `.ignore`/`.rgignore` files in parent directories.
* [BUG #2480](BurntSushi/ripgrep#2480):
  Fix bug when using inline regex flags with `-e/--regexp`.
* [BUG #2505](BurntSushi/ripgrep#2505):
  Improve docs for `--vimgrep` by mentioning footguns and some work-arounds.
* [BUG #2519](BurntSushi/ripgrep#2519):
  Fix incorrect default value in documentation for `--field-match-separator`.
* [BUG #2523](BurntSushi/ripgrep#2523):
  Make executable searching take `.com` into account on Windows.
* [BUG #2574](BurntSushi/ripgrep#2574):
  Fix bug in `-w/--word-regexp` that would result in incorrect match offsets.
* [BUG #2623](BurntSushi/ripgrep#2623):
  Fix a number of bugs with the `-w/--word-regexp` flag.
* [BUG #2636](BurntSushi/ripgrep#2636):
  Strip release binaries for macOS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug. help wanted Others are encouraged to work on this issue.
Projects
None yet
Development

No branches or pull requests

4 participants