Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redirecting ripgrep to a file within the search directory will cause rg to also search that file #286

Closed
Ophirr33 opened this issue Dec 22, 2016 · 5 comments · Fixed by #310
Labels
bug A bug.

Comments

@Ophirr33
Copy link
Sponsor

Ophirr33 commented Dec 22, 2016

This leads to ripgrep searching its own output, taking a lot of time and eating up disk space. Here's a minimal example:

mkdir rg-test && cd rg-test
echo "hello" > found.txt
rg -e "hello" > tmp.txt
rg -e "hello" > tmp2.txt"
rg -e "hello" > tmp3.txt"

Then, running rg -e "hello" will output the following:

found.txt
1:hello

tmp.txt
1:found.txt:hello
2:tmp.txt:found.txt:hello

tmp2.txt
1:found.txt:hello
2:tmp.txt:found.txt:hello
3:tmp.txt:tmp.txt:found.txt:hello
4:tmp2.txt:found.txt:hello
5:tmp2.txt:tmp.txt:found.txt:hello
6:tmp2.txt:tmp.txt:tmp.txt:found.txt:hello 

tmp3.txt
1:found.txt:hello
2:tmp.txt:found.txt:hello
3:tmp.txt:tmp.txt:found.txt:hello
4:tmp2.txt:found.txt:hello
5:tmp2.txt:tmp.txt:found.txt:hello
6:tmp2.txt:tmp.txt:tmp.txt:found.txt:hello
7:tmp2.txt:tmp2.txt:found.txt:hello
8:tmp2.txt:tmp2.txt:tmp.txt:found.txt:hello
9:tmp2.txt:tmp2.txt:tmp.txt:tmp.txt:found.txt:hello
10:tmp3.txt:found.txt:hello
11:tmp3.txt:tmp.txt:found.txt:hello
12:tmp3.txt:tmp.txt:tmp.txt:found.txt:hello
13:tmp3.txt:tmp2.txt:found.txt:hello
14:tmp3.txt:tmp2.txt:tmp.txt:found.txt:hello
15:tmp3.txt:tmp2.txt:tmp.txt:tmp.txt:found.txt:hello
16:tmp3.txt:tmp2.txt:tmp2.txt:found.txt:hello
17:tmp3.txt:tmp2.txt:tmp2.txt:tmp.txt:found.txt:hello
18:tmp3.txt:tmp2.txt:tmp2.txt:tmp.txt:tmp.txt:found.txt:hello

When doing the same with grep instead, we get the expected output (and switching hello with yellow)

found2.txt
1:yellow

temp.txt
1:./found2.txt:yellow

temp2.txt
1:./found2.txt:yellow
2:./temp.txt:./found2.txt:yellow

temp3.txt
1:./found2.txt:yellow
2:./temp.txt:./found2.txt:yellow
3:./temp2.txt:./found2.txt:yellow
4:./temp2.txt:./temp.txt:./found2.txt:yellow

The more searches that ripgrep reports, the bigger the file blow up is (I did this on a ripgrep search that had 17,000 hits...). Currently, this can be worked around just by piping to a file outside the directory being searched, or piping into a pager instead.

@BurntSushi
Copy link
Owner

I think we can probably fix this by getting the file descriptor of where stdout is being redirected and then make sure we don't search any file with that same descriptor.

@BurntSushi BurntSushi added the bug A bug. label Dec 23, 2016
@retep998
Copy link

retep998 commented Jan 2, 2017

On Windows you'd solve this by getting the ID of the file. Call GetFileInformationByHandle and examine the dwVolumeSerialNumber nFileIndexHigh and nFileIndexLow fields.

@BurntSushi
Copy link
Owner

@retep998 There's some subtle additional details. From MSDN:

The identifier that is stored in the nFileIndexHigh and nFileIndexLow members is called the file ID. Support for file IDs is file system-specific. File IDs are not guaranteed to be unique over time, because file systems are free to reuse them. In some cases, the file ID for a file can change over time.

AFAIK, the only way to get a guarantee that the file index numbers aren't reused is if you keep the file handle open. Which I guess is fine in this case, since we only need to keep one handle (stdout) open for the lifetime of the process.

@BurntSushi
Copy link
Owner

I guess this applies similarly on Linux too. You can't rely on just the inode number, since your directory might span multiple mount points.

@BurntSushi
Copy link
Owner

BurntSushi commented Jan 8, 2017

There's enough subtlety here that I'm going to push this logic into a new independent crate, and then use that inside of both ripgrep and walkdir. walkdir does expose is_same_file, but the API operates on paths, and we'd ideally not need to stat the stdout file every single time.

BurntSushi added a commit that referenced this issue Jan 9, 2017
When running ripgrep like this:

    rg foo > output

we must be careful not to search `output` since ripgrep is actively writing
to it. Searching it can cause massive blowups where the file grows without
bound.

While this is conceptually easy to fix (check the inode of the redirection
and the inode of the file you're about to search), there are a few problems
with it.

First, inodes are a Unix thing, so we need a Windows specific solution to
this as well. To resolve this concern, I created a new crate, `same-file`,
which provides a cross platform abstraction.

Second, stat'ing every file is costly. This is not avoidable on Windows,
but on Unix, we can get the inode number directly from directory traversal.
However, this information wasn't exposed, but now it is (through both the
ignore and walkdir crates).

Fixes #286
BurntSushi added a commit that referenced this issue Jan 9, 2017
When running ripgrep like this:

    rg foo > output

we must be careful not to search `output` since ripgrep is actively writing
to it. Searching it can cause massive blowups where the file grows without
bound.

While this is conceptually easy to fix (check the inode of the redirection
and the inode of the file you're about to search), there are a few problems
with it.

First, inodes are a Unix thing, so we need a Windows specific solution to
this as well. To resolve this concern, I created a new crate, `same-file`,
which provides a cross platform abstraction.

Second, stat'ing every file is costly. This is not avoidable on Windows,
but on Unix, we can get the inode number directly from directory traversal.
However, this information wasn't exposed, but now it is (through both the
ignore and walkdir crates).

Fixes #286
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants