Redirecting ripgrep to a file within the search directory will cause rg to also search that file #286

Ophirr33 · 2016-12-22T19:15:21Z

This leads to ripgrep searching its own output, taking a lot of time and eating up disk space. Here's a minimal example:

mkdir rg-test && cd rg-test
echo "hello" > found.txt
rg -e "hello" > tmp.txt
rg -e "hello" > tmp2.txt"
rg -e "hello" > tmp3.txt"

Then, running rg -e "hello" will output the following:

found.txt
1:hello

tmp.txt
1:found.txt:hello
2:tmp.txt:found.txt:hello

tmp2.txt
1:found.txt:hello
2:tmp.txt:found.txt:hello
3:tmp.txt:tmp.txt:found.txt:hello
4:tmp2.txt:found.txt:hello
5:tmp2.txt:tmp.txt:found.txt:hello
6:tmp2.txt:tmp.txt:tmp.txt:found.txt:hello 

tmp3.txt
1:found.txt:hello
2:tmp.txt:found.txt:hello
3:tmp.txt:tmp.txt:found.txt:hello
4:tmp2.txt:found.txt:hello
5:tmp2.txt:tmp.txt:found.txt:hello
6:tmp2.txt:tmp.txt:tmp.txt:found.txt:hello
7:tmp2.txt:tmp2.txt:found.txt:hello
8:tmp2.txt:tmp2.txt:tmp.txt:found.txt:hello
9:tmp2.txt:tmp2.txt:tmp.txt:tmp.txt:found.txt:hello
10:tmp3.txt:found.txt:hello
11:tmp3.txt:tmp.txt:found.txt:hello
12:tmp3.txt:tmp.txt:tmp.txt:found.txt:hello
13:tmp3.txt:tmp2.txt:found.txt:hello
14:tmp3.txt:tmp2.txt:tmp.txt:found.txt:hello
15:tmp3.txt:tmp2.txt:tmp.txt:tmp.txt:found.txt:hello
16:tmp3.txt:tmp2.txt:tmp2.txt:found.txt:hello
17:tmp3.txt:tmp2.txt:tmp2.txt:tmp.txt:found.txt:hello
18:tmp3.txt:tmp2.txt:tmp2.txt:tmp.txt:tmp.txt:found.txt:hello

When doing the same with grep instead, we get the expected output (and switching hello with yellow)

found2.txt
1:yellow

temp.txt
1:./found2.txt:yellow

temp2.txt
1:./found2.txt:yellow
2:./temp.txt:./found2.txt:yellow

temp3.txt
1:./found2.txt:yellow
2:./temp.txt:./found2.txt:yellow
3:./temp2.txt:./found2.txt:yellow
4:./temp2.txt:./temp.txt:./found2.txt:yellow

The more searches that ripgrep reports, the bigger the file blow up is (I did this on a ripgrep search that had 17,000 hits...). Currently, this can be worked around just by piping to a file outside the directory being searched, or piping into a pager instead.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-12-23T11:46:16Z

I think we can probably fix this by getting the file descriptor of where stdout is being redirected and then make sure we don't search any file with that same descriptor.

retep998 · 2017-01-02T17:59:03Z

On Windows you'd solve this by getting the ID of the file. Call GetFileInformationByHandle and examine the dwVolumeSerialNumber nFileIndexHigh and nFileIndexLow fields.

BurntSushi · 2017-01-08T17:26:53Z

@retep998 There's some subtle additional details. From MSDN:

The identifier that is stored in the nFileIndexHigh and nFileIndexLow members is called the file ID. Support for file IDs is file system-specific. File IDs are not guaranteed to be unique over time, because file systems are free to reuse them. In some cases, the file ID for a file can change over time.

AFAIK, the only way to get a guarantee that the file index numbers aren't reused is if you keep the file handle open. Which I guess is fine in this case, since we only need to keep one handle (stdout) open for the lifetime of the process.

BurntSushi · 2017-01-08T17:27:37Z

I guess this applies similarly on Linux too. You can't rely on just the inode number, since your directory might span multiple mount points.

BurntSushi · 2017-01-08T18:26:05Z

There's enough subtlety here that I'm going to push this logic into a new independent crate, and then use that inside of both ripgrep and walkdir. walkdir does expose is_same_file, but the API operates on paths, and we'd ideally not need to stat the stdout file every single time.

When running ripgrep like this: rg foo > output we must be careful not to search `output` since ripgrep is actively writing to it. Searching it can cause massive blowups where the file grows without bound. While this is conceptually easy to fix (check the inode of the redirection and the inode of the file you're about to search), there are a few problems with it. First, inodes are a Unix thing, so we need a Windows specific solution to this as well. To resolve this concern, I created a new crate, `same-file`, which provides a cross platform abstraction. Second, stat'ing every file is costly. This is not avoidable on Windows, but on Unix, we can get the inode number directly from directory traversal. However, this information wasn't exposed, but now it is (through both the ignore and walkdir crates). Fixes #286

BurntSushi added the bug A bug. label Dec 23, 2016

BurntSushi mentioned this issue Jan 9, 2017

Don't search stdout redirected file. #310

Merged

BurntSushi closed this as completed in #310 Jan 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redirecting ripgrep to a file within the search directory will cause rg to also search that file #286

Redirecting ripgrep to a file within the search directory will cause rg to also search that file #286

Ophirr33 commented Dec 22, 2016 •

edited

Loading

BurntSushi commented Dec 23, 2016

retep998 commented Jan 2, 2017

BurntSushi commented Jan 8, 2017

BurntSushi commented Jan 8, 2017

BurntSushi commented Jan 8, 2017 •

edited

Loading

Redirecting ripgrep to a file within the search directory will cause rg to also search that file #286

Redirecting ripgrep to a file within the search directory will cause rg to also search that file #286

Comments

Ophirr33 commented Dec 22, 2016 • edited Loading

BurntSushi commented Dec 23, 2016

retep998 commented Jan 2, 2017

BurntSushi commented Jan 8, 2017

BurntSushi commented Jan 8, 2017

BurntSushi commented Jan 8, 2017 • edited Loading

Ophirr33 commented Dec 22, 2016 •

edited

Loading

BurntSushi commented Jan 8, 2017 •

edited

Loading