finding zero bytes in utf-16 encoded files #1207

LesnyRumcajs · 2019-02-28T19:42:09Z

What version of ripgrep are you using?

ripgrep 0.10.0
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

cargo install ripgrep

What operating system are you using ripgrep on?

Fedora 29

Describe your question, feature request, or bug.

I'm struggling to find files that contain 00 bytes. I created an UTF-16 LE file with text test (hexdump)

00000000: fffe 7400 6500 7300 7400 0a00            ..t.e.s.t...

Given that there are 00 bytes inside I'm issuing a command rg -cuuu '(?-u:\x00)' but get no results at all. It works for searching for t, like

rg -cuuu '(?-u:\x73)'
test.txt:1

From my understanding the -uuu flag along with some UTF escaping should do the trick. It works fine for non-zero bytes. It also works for binary files (tried with a file comprising of a single 00 byte). Am I missing something?

The text was updated successfully, but these errors were encountered:

lespea · 2019-02-28T19:51:04Z

I think ripgrep is seeing the BOM and properly decoding the test as utf-16 (hence no null bytes). Try using the --no-encoding flag?

    -E, --encoding <ENCODING>
            Specify the text encoding that ripgrep will use on all files searched. The
            default value is 'auto', which will cause ripgrep to do a best effort automatic
            detection of encoding on a per-file basis. Automatic detection in this case
            only applies to files that begin with a UTF-8 or UTF-16 byte-order mark (BOM).
            No other automatic detection is performend.

            Other supported values can be found in the list of labels here:
            https://encoding.spec.whatwg.org/#concept-encoding-get

            For more details on encoding and how ripgrep deals with it, see GUIDE.md.

            This flag can be disabled with --no-encoding.

LesnyRumcajs · 2019-02-28T19:57:53Z

@lespea I'm pretty sure that's the case (removing BOM makes the command find the bytes, it's treating it as a plain binary). I don't know how to get over it though. Using --no-encoding as in rg --no-encoding -cuuu '(?-u:\x00)' still doesn't detect null bytes in UTF-16 LE file.

BurntSushi · 2019-02-28T21:18:50Z

Interesting issue. The -u flags are superfluous in this case, sans the last one, which you can just replace with -a. The (?-u:\x00) can always just be written as \x00 since codepoint 0 corresponds to byte 0. The --no-encoding flag is also a red herring here, since all that does is disable the use of an --encoding flag, e.g., by resetting it back to auto.

The reason why this is happening is because ripgrep is indeed detecting the BOM and transcoding your UTF-16 to UTF-8, which gets rid of all NUL bytes in this case. ripgrep does not expose any options to override this behavior. Even if you set -E utf8, the BOM still takes precedence because ripgrep enables this option. This option is enabled because the BOM is a super strong indicator of the encoding of the text file, so even if you specify UTF-8, it's still good behavior to switch to UTF-16 automatically when necessary.

I suspect the way to fix this is to allow one to specify -E none such that ripgrep never does any transcoding at all.

The only work-around available to you at the moment, as far as I know, is stripping the BOM:

[andrew@Cheetah rg1207]$ xxd foo.utf16le
00000000: 7400 6500 7300 7400 0a00                 t.e.s.t...
[andrew@Cheetah rg1207]$ rg '\x00' foo.utf16le -a
1:test
2:

LesnyRumcajs · 2019-02-28T21:42:29Z

@BurntSushi Thanks for explaining. I understand my case is a little bit out of the normal usage of ripgrep - I'm processing text files that may be corrupted (UTF8, UTF16 ... or a mix of them, don't ask). Anyway I think -E non or 4th u to further reduce ripgrep's smartness would be a nice feature for such corner cases.

BurntSushi · 2019-02-28T22:07:30Z

Yes, I agree. ripgrep should be able to handle this use case. There should be a way to override transcoding so that you can treat even completely valid UTF-16 as arbitrary bytes.

LesnyRumcajs · 2019-03-01T09:05:13Z

@BurntSushi Do you consider it a good first task? Or would it require some major rewrite? I'd like to contribute.

BurntSushi · 2019-03-01T12:39:58Z

@LesnyRumcajs Ah yes, great idea! This is probably a decent first task, although there is a fair bit of plumbing. The high level idea is to add support for completely disabling transcoding support. This change will require changes to three different crates. (Such is the cost for splitting ripgrep's functionality out so that others can use it.)

encoding_rs_io's DecodeReaderBytesBuilder should get a new option, probably called bom_sniffing that is enabled by default but can be disabled. When disabled, BOMs are always ignored, but if an encoding was given, then that encoding is still used. You'll need to add support for this new option in the implementation, and at least one test. This will need to be a separate PR since encoding_rs_io lives in its own repository. I can cut a new release once this PR is merged. This is the most interesting part of this task; the rest is strictly plumbing.
Next, grep-searcher's SearcherBuilder needs to provide its own bom_sniffing option that forwards its value to DecodeReaderBytesBuilder.
You'll want to modify ripgrep core itself to make use of this new bom_sniffing option. One way of doing this is modifying the function that interprets which encoding to use from the command line arguments. I'd probably do this by replacing the return type, Result<Option<Encoding>> with Result<EncodingMode>, where EncodingMode is a new type defined like so:

#[derive(Debug)]
enum EncodingMode {
    // Use an explicit encoding forcefully, but let BOM sniffing override it.
    Some(Encoding),
    // Use only BOM sniffing to auto-detect an encoding.
    Auto,
    // Use no explicit encoding and disable all BOM sniffing. This will
    // always result in searching the raw bytes, regardless of their
    // true encoding.
    Disabled,
}

You'll need to add support for a new none value for the encoding flag. This should be documented in the encoding flag's docs and to zsh's auto completion list.

At this point, you'll get some compiler errors because the callers of the encoding method need to be updated. Specifically, you need to change the SearcherBuilder configuration to use your new type, e.g., setting bom_sniffing(false) when you have EncodingMode::Disabled. You'll also need to update PCRE2's use of the encoding() method. Right now, it disables the UTF-8 check only when an explicit encoding is specified. So you should add a new predictate function to EncodingMode, e.g., has_explicit_encoding, that can replace the use of is_some(). In all other cases, PCRE2 needs to keep its UTF-8 check.

Finally, add a new test covering this feature to ripgrep's integration tests for new features. If you need help with this let me know, but I think just following some of the pattern of other examples should be good.

Writing all that out makes it seem like a fair bit of work, but I think it's doable!

LesnyRumcajs · 2019-03-01T17:04:36Z

@BurntSushi Woah, thanks a lot for the hints - you saved me several hours of grinding my teeth and potential PR rejects! I'll get to it.

This makes it possible to use the transcoder to pass through its bytes unconditionally without any transcoding. This is the same as not using it at all, but makes consumer code organization a bit simpler if this is linked back to a runtime configuration option. This addresses part of the work toward completing BurntSushi/ripgrep#1207

This brings in a new API for disabling BOM sniffing. This is part of the work toward completing #1207

This commit adds a new encoding feature where the -E/--encoding flag will now accept a value of 'none'. When given this value, all encoding related machinery is disabled and ripgrep will search the raw bytes of the file, including the BOM if it's present. Closes #1207, Closes #1208

BurntSushi added the enhancement An enhancement to the functionality of the software. label Feb 28, 2019

LesnyRumcajs mentioned this issue Mar 3, 2019

Add bom_sniffing option to DecodeReaderBytesBuilder BurntSushi/encoding_rs_io#5

Merged

BurntSushi added a commit that referenced this issue Mar 3, 2019

deps: bump encoding_rs_io

0913972

This brings in a new API for disabling BOM sniffing. This is part of the work toward completing #1207

LesnyRumcajs mentioned this issue Mar 4, 2019

add option to disable bom sniffing #1208

Closed

xtaran mentioned this issue Mar 26, 2019

Exits immediately without warning if it encounters a NUL byte inside the file to be searched, might exit with wrong exit code depending on the position of the match #1227

Closed

BurntSushi mentioned this issue Apr 6, 2019

searcher: add option to disable BOM sniffing #1237

Merged

BurntSushi closed this as completed in #1237 Apr 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finding zero bytes in utf-16 encoded files #1207

finding zero bytes in utf-16 encoded files #1207

LesnyRumcajs commented Feb 28, 2019

lespea commented Feb 28, 2019

LesnyRumcajs commented Feb 28, 2019

BurntSushi commented Feb 28, 2019

LesnyRumcajs commented Feb 28, 2019

BurntSushi commented Feb 28, 2019

LesnyRumcajs commented Mar 1, 2019

BurntSushi commented Mar 1, 2019 •

edited

LesnyRumcajs commented Mar 1, 2019

finding zero bytes in utf-16 encoded files #1207

finding zero bytes in utf-16 encoded files #1207

Comments

LesnyRumcajs commented Feb 28, 2019

What version of ripgrep are you using?

How did you install ripgrep?

What operating system are you using ripgrep on?

Describe your question, feature request, or bug.

lespea commented Feb 28, 2019

LesnyRumcajs commented Feb 28, 2019

BurntSushi commented Feb 28, 2019

LesnyRumcajs commented Feb 28, 2019

BurntSushi commented Feb 28, 2019

LesnyRumcajs commented Mar 1, 2019

BurntSushi commented Mar 1, 2019 • edited

LesnyRumcajs commented Mar 1, 2019

BurntSushi commented Mar 1, 2019 •

edited