[WIP] Add support for compressed file formats (closes #539) #751

balajisivaraman · 2018-01-14T12:02:19Z

Creating this PR for review purposes. I still haven't added any tests for the newer code, which is one of the main things still pending here.

Notes:

Had to add globset as dependency for ripgrep module to easily check for supported file formats. The in-built path.extension() method did not allow me to verify for *.tar.gz files. 😞
We could've used output from Command instead of spawn, which pipes stdout and stderr by default and gives us access to the process' exit status. However, it would've meant loading the entire file into memory in the form of a Vec<u8>. I've therefore gone with the spawn approach here, coupled with peeking at the stderr to verify whether the command succeeded or failed. It works nicely enough in practice in my testing.
Also, I noticed that the spawn method ends up returning an Err immediately if the command is not found. All other errors are obtained from the process' stderr. That's why I've handled it such that if spawn returns an Error, then rg will output the appropriate debug message.
I couldn't think of any other CLI flag that conflicts with -z/--search-zip. One thing to think of is how should rg handle when the -a argument is provided? Right now, if -a and -z are both provided, rg will search zip files as per the new code, but will also end up treating it as a binary file and try searching it. Are we OK with this?
Not really fond of passing the no-messages flag to the decompressor, but there's no better option if we want all error handling to remain within the module itself. In this way, the worker module remains largely unaffected if we choose to change how we do decompression at a later point in time.

Any other thoughts and suggestions are welcome.

okdana · 2018-01-14T12:39:24Z

complete/_rg

@@ -87,6 +87,7 @@ _rg() {
    '(-w -x --line-regexp --word-regexp)'{-w,--word-regexp}'[only show matches surrounded by word boundaries]'
    '(-e -f --file --files --regexp --type-list)1: :_rg_pattern'
    '(--type-list)*:file:_files'
+    '*'{-z,--search-zip}'search through zip files'


The exclusions and the syntax itself are wrong here. At a minimum it should be something like this:

'(-z --search-zip)'{-z,--search-zip}'[search through zip files]'

We might also want to add other exclusions, but i'd like to do another pass on this file within the next few weeks, so it can wait until then to get fancier.

I would also suggest changing the description wording to something like search contents of compressed files — 'through' is a bit ambiguous, and we don't actually search zip files at all.

I'll fix the syntax as per your suggestion. This file is always throwing me off. Sorry about that.

I thought of using -C/--search-compressed instead of -z/--search-zip. The only reason I went with the latter is that it would be familiar to folks coming from using ag. Like you said, search-zip doesn't really make a lot of sense here. Should we just go with the former option?

The completion-function syntax takes a while to get used to, yeah. Every few weeks i learn that something i used to do in these files is wrong, lol.

I like -z for the short option. It's what ag and sift both use, and the connection with 'Ziv' is immediately intuitive. As far as the long option, idk. --search-zip feels slightly inaccurate, but you're right about ag — it's one of the most common recursive file searchers, and definitely the most well known one that supports searching compressed files, so it's not unreasonable to borrow the terminology. --search-compressed makes sense too tho, w/e 🤷‍♀️

okdana · 2018-01-14T12:41:42Z

doc/rg.1

+.RS
+.PP
+Note that rg expects to find the decompression binaries for the
+respective formats on your system\[aq]s PATH for use with this flag.


'On your PATH' feels strange to me — 'in your PATH'?

okdana · 2018-01-14T12:42:31Z

src/app.rs

+        doc!(h, "search-zip",
+             "Search through compressed files.",
+             "Search through compressed files. Currently gz, bz2, xz, and \
+              lzma files are supported. This options expects the uncompression \


Same comment about 'on your PATH', but also i think 'uncompression' should be 'decompression' here

okdana · 2018-01-14T12:48:58Z

There is a problem with the completion function, and i added a few bike-shed wording remarks. I won't comment on the Rust stuff except to say that it does at least make sense to me.

I tested the patch on macOS (just briefly) and it works! This is neat as hell, ty!

tamird · 2018-01-17T19:02:02Z

src/decompressor.rs

+}
+
+lazy_static! {
+    static ref DECOMPRESSION_COMMANDS: HashMap<&'static str, DecompressionCommand> = {


why isn't this an enum?

better yet, can't this stuff be done without spawning subprocesses? i imagine there are rust libraries for most of these.

The rationale for this implementation is explained in #225 and #539

There are still very few pure-Rust (de)compression libraries AFAIK

BurntSushi

@balajisivaraman Thanks so much for working on this! I really wanted to get this merged and there were some fiddly details to get right here, so I went ahead and based a new PR on your work here: #767. I hope that's OK. Most of your code made it through :-) but I did need to refactor the get_reader method (which is now DecompressionReader::from_path).

I left some comments on this PR that roughly correspond to the changes I made.

BurntSushi · 2018-01-30T02:29:36Z

src/decompressor.rs

+use std::process::Command;
+use std::process::Stdio;
+use std::process::ChildStdout;
+use globset::{Glob, GlobSet, GlobSetBuilder};


Imports should be consistent with the rest of the code. They should be grouped in three blocks, where each block is separated by whitespace. The first block are std imports. The second block are external crate imports. The third block are internal crate imports. Each block should be sorted lexicographically.

BurntSushi · 2018-01-30T02:30:07Z

src/decompressor.rs

+/// necessary CLI args
+struct DecompressionCommand {
+    cmd: &'static str,
+    args: Vec<&'static str>


Since we are building these values in a lazy_static!, we don't actually need the Vec. We can use an &'static [&'static str] instead.

BurntSushi · 2018-01-30T02:31:23Z

src/decompressor.rs

+pub fn get_reader(path: &Path, no_messages: bool) -> Option<ChildStdout> {
+    if is_tar_archive(path) {
+        debug!("Tar archives are currently unsupported: {}", path.display());
+        None


In general, I like to keep control flow straight-forward and avoid gratuitous nesting. Early returns are very useful for this. For example, this code could be written as:

if is_tar_archive(path) { debug!(...); return None; } let extension = ...;

That way, you don't need to indent the entire rest of the function.

Ah, that makes a lot of sense. I did do this in the worker.rs file but missed it here. 👍

BurntSushi · 2018-01-30T02:38:25Z

src/decompressor.rs

+                                  String::from_utf8_lossy(error.as_slice()));
+                    }
+                    None
+                }


OK, so the issue here is that this will deadlock in non-trivial cases. You might not have hit it if you only tested on small files. Basically the issue here is that if you don't read from stdout, then the process you spawned (if it's implemented sanely) will at some point block on writing to stdout because it is connected to a pipe. If the calling process never reads from stdout, then the spawned process will eventually block. If the calling process is in turn blocked on something else from the spawned process---which in this case, we are, because we're trying to read from stderr until the spawned process stops writing to it---then you wind up with a deadlock.

If you only test this on small files then you won't notice this because the spawned process will likely be using buffered output. And since it's connected to a pipe, the buffer is usually biggish (~4K?) as opposed to being line buffered on a tty. As a result, the spawned process will finish doing its thing before ever flushing its buffer, which gives the appearance of code that works. :-)

The "fix" to this is to couple the child with reading stdout and wait to read stderr until the process terminates. This basically means that you can't return ChildStdout directly. You need to return something else that wraps up this logic.

As an additional upside, the reader that you return can return normal error values, which are handled by the worker. This means you don't need to thread no_messages through the code.

Ah, I knew it. 😞 When I made this change and looked at the code, I knew there had to be something I was missing.

Thank you for the detailed write up on the issues with my code. It makes a lot of sense to me and helps me understand it better. It also gives me incentive to do some more reading up on pipes and child processes, which I will. 👍

BurntSushi · 2018-01-30T02:39:46Z

tests/tests.rs

+    let err = String::from_utf8_lossy(&output.stderr);
+    assert_eq!(err.contains("not in gzip format"), true);
+}
+


Fantastic tests, thank you. :-)

BurntSushi · 2018-01-30T14:14:21Z

I merged this PR in #767. Thanks again @balajisivaraman!

balajisivaraman · 2018-01-30T15:28:58Z

@BurntSushi, Thank you for merging this. It is one of my first significant contributions of any kind to a OSS project. I learned a lot from working on it as well as going through the changes you made on top of it. Cheers. :-)

BurntSushi · 2018-01-30T15:35:58Z

@balajisivaraman It's a helluva first contribution! It will be a highlight feature in the next release. :)

balajisivaraman changed the title ~~[WIP] Add support for compressed file formats~~ [WIP] Add support for compressed file formats (closes #539) Jan 14, 2018

okdana reviewed Jan 14, 2018

View reviewed changes

tamird reviewed Jan 17, 2018

View reviewed changes

BurntSushi mentioned this pull request Jan 29, 2018

add decompression support on Unix systems #539

Closed

Add support for compressed file formats

de3c935

BurntSushi mentioned this pull request Jan 30, 2018

search: add support for searching compressed files #767

Merged

BurntSushi reviewed Jan 30, 2018

View reviewed changes

BurntSushi mentioned this pull request Jan 30, 2018

support searching compressed files using in-process decompression #225

Closed

BurntSushi closed this Jan 30, 2018

balajisivaraman deleted the fix_539 branch January 30, 2018 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add support for compressed file formats (closes #539) #751

[WIP] Add support for compressed file formats (closes #539) #751

balajisivaraman commented Jan 14, 2018

okdana Jan 14, 2018

balajisivaraman Jan 14, 2018

okdana Jan 14, 2018

okdana Jan 14, 2018

okdana Jan 14, 2018

okdana commented Jan 14, 2018

tamird Jan 17, 2018

okdana Jan 17, 2018

BurntSushi left a comment •

edited

Loading

BurntSushi Jan 30, 2018

BurntSushi Jan 30, 2018

BurntSushi Jan 30, 2018

balajisivaraman Jan 30, 2018

BurntSushi Jan 30, 2018

BurntSushi Jan 30, 2018

balajisivaraman Jan 30, 2018

BurntSushi Jan 30, 2018

BurntSushi commented Jan 30, 2018

balajisivaraman commented Jan 30, 2018

BurntSushi commented Jan 30, 2018

[WIP] Add support for compressed file formats (closes #539) #751

[WIP] Add support for compressed file formats (closes #539) #751

Conversation

balajisivaraman commented Jan 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

okdana commented Jan 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurntSushi left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurntSushi commented Jan 30, 2018

balajisivaraman commented Jan 30, 2018

BurntSushi commented Jan 30, 2018

BurntSushi left a comment •

edited

Loading