Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finding zero bytes in utf-16 encoded files #1207

Closed
LesnyRumcajs opened this issue Feb 28, 2019 · 8 comments
Closed

finding zero bytes in utf-16 encoded files #1207

LesnyRumcajs opened this issue Feb 28, 2019 · 8 comments
Labels
enhancement An enhancement to the functionality of the software.

Comments

@LesnyRumcajs
Copy link
Contributor

What version of ripgrep are you using?

ripgrep 0.10.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

cargo install ripgrep

What operating system are you using ripgrep on?

Fedora 29

Describe your question, feature request, or bug.

I'm struggling to find files that contain 00 bytes. I created an UTF-16 LE file with text test (hexdump)

00000000: fffe 7400 6500 7300 7400 0a00            ..t.e.s.t...

Given that there are 00 bytes inside I'm issuing a command rg -cuuu '(?-u:\x00)' but get no results at all. It works for searching for t, like

rg -cuuu '(?-u:\x73)'
test.txt:1

From my understanding the -uuu flag along with some UTF escaping should do the trick. It works fine for non-zero bytes. It also works for binary files (tried with a file comprising of a single 00 byte). Am I missing something?

@lespea
Copy link

lespea commented Feb 28, 2019

I think ripgrep is seeing the BOM and properly decoding the test as utf-16 (hence no null bytes). Try using the --no-encoding flag?

    -E, --encoding <ENCODING>
            Specify the text encoding that ripgrep will use on all files searched. The
            default value is 'auto', which will cause ripgrep to do a best effort automatic
            detection of encoding on a per-file basis. Automatic detection in this case
            only applies to files that begin with a UTF-8 or UTF-16 byte-order mark (BOM).
            No other automatic detection is performend.

            Other supported values can be found in the list of labels here:
            https://encoding.spec.whatwg.org/#concept-encoding-get

            For more details on encoding and how ripgrep deals with it, see GUIDE.md.

            This flag can be disabled with --no-encoding.

@LesnyRumcajs
Copy link
Contributor Author

@lespea I'm pretty sure that's the case (removing BOM makes the command find the bytes, it's treating it as a plain binary). I don't know how to get over it though. Using --no-encoding as in rg --no-encoding -cuuu '(?-u:\x00)' still doesn't detect null bytes in UTF-16 LE file.

@BurntSushi
Copy link
Owner

Interesting issue. The -u flags are superfluous in this case, sans the last one, which you can just replace with -a. The (?-u:\x00) can always just be written as \x00 since codepoint 0 corresponds to byte 0. The --no-encoding flag is also a red herring here, since all that does is disable the use of an --encoding flag, e.g., by resetting it back to auto.

The reason why this is happening is because ripgrep is indeed detecting the BOM and transcoding your UTF-16 to UTF-8, which gets rid of all NUL bytes in this case. ripgrep does not expose any options to override this behavior. Even if you set -E utf8, the BOM still takes precedence because ripgrep enables this option. This option is enabled because the BOM is a super strong indicator of the encoding of the text file, so even if you specify UTF-8, it's still good behavior to switch to UTF-16 automatically when necessary.

I suspect the way to fix this is to allow one to specify -E none such that ripgrep never does any transcoding at all.

The only work-around available to you at the moment, as far as I know, is stripping the BOM:

[andrew@Cheetah rg1207]$ xxd foo.utf16le
00000000: 7400 6500 7300 7400 0a00                 t.e.s.t...
[andrew@Cheetah rg1207]$ rg '\x00' foo.utf16le -a
1:test
2:

@BurntSushi BurntSushi added the enhancement An enhancement to the functionality of the software. label Feb 28, 2019
@LesnyRumcajs
Copy link
Contributor Author

@BurntSushi Thanks for explaining. I understand my case is a little bit out of the normal usage of ripgrep - I'm processing text files that may be corrupted (UTF8, UTF16 ... or a mix of them, don't ask). Anyway I think -E non or 4th u to further reduce ripgrep's smartness would be a nice feature for such corner cases.

@BurntSushi
Copy link
Owner

Yes, I agree. ripgrep should be able to handle this use case. There should be a way to override transcoding so that you can treat even completely valid UTF-16 as arbitrary bytes.

@LesnyRumcajs
Copy link
Contributor Author

@BurntSushi Do you consider it a good first task? Or would it require some major rewrite? I'd like to contribute.

@BurntSushi
Copy link
Owner

BurntSushi commented Mar 1, 2019

@LesnyRumcajs Ah yes, great idea! This is probably a decent first task, although there is a fair bit of plumbing. The high level idea is to add support for completely disabling transcoding support. This change will require changes to three different crates. (Such is the cost for splitting ripgrep's functionality out so that others can use it.)

#[derive(Debug)]
enum EncodingMode {
    // Use an explicit encoding forcefully, but let BOM sniffing override it.
    Some(Encoding),
    // Use only BOM sniffing to auto-detect an encoding.
    Auto,
    // Use no explicit encoding and disable all BOM sniffing. This will
    // always result in searching the raw bytes, regardless of their
    // true encoding.
    Disabled,
}

You'll need to add support for a new none value for the encoding flag. This should be documented in the encoding flag's docs and to zsh's auto completion list.

At this point, you'll get some compiler errors because the callers of the encoding method need to be updated. Specifically, you need to change the SearcherBuilder configuration to use your new type, e.g., setting bom_sniffing(false) when you have EncodingMode::Disabled. You'll also need to update PCRE2's use of the encoding() method. Right now, it disables the UTF-8 check only when an explicit encoding is specified. So you should add a new predictate function to EncodingMode, e.g., has_explicit_encoding, that can replace the use of is_some(). In all other cases, PCRE2 needs to keep its UTF-8 check.

Finally, add a new test covering this feature to ripgrep's integration tests for new features. If you need help with this let me know, but I think just following some of the pattern of other examples should be good.

Writing all that out makes it seem like a fair bit of work, but I think it's doable!

@LesnyRumcajs
Copy link
Contributor Author

@BurntSushi Woah, thanks a lot for the hints - you saved me several hours of grinding my teeth and potential PR rejects! I'll get to it.

BurntSushi pushed a commit to BurntSushi/encoding_rs_io that referenced this issue Mar 3, 2019
This makes it possible to use the transcoder to pass through its
bytes unconditionally without any transcoding. This is the same as
not using it at all, but makes consumer code organization a bit
simpler if this is linked back to a runtime configuration option.

This addresses part of the work toward completing
BurntSushi/ripgrep#1207
BurntSushi added a commit that referenced this issue Mar 3, 2019
This brings in a new API for disabling BOM sniffing.

This is part of the work toward completing
#1207
BurntSushi pushed a commit that referenced this issue Apr 6, 2019
This commit adds a new encoding feature where the -E/--encoding flag
will now accept a value of 'none'. When given this value, all encoding
related machinery is disabled and ripgrep will search the raw bytes of
the file, including the BOM if it's present.

Closes #1207, Closes #1208
BurntSushi pushed a commit that referenced this issue Apr 6, 2019
This commit adds a new encoding feature where the -E/--encoding flag
will now accept a value of 'none'. When given this value, all encoding
related machinery is disabled and ripgrep will search the raw bytes of
the file, including the BOM if it's present.

Closes #1207, Closes #1208
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to the functionality of the software.
Projects
None yet
Development

No branches or pull requests

3 participants