-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offset in --json
output does not take into account BOM bytes
#1627
Comments
Another thing that's similar to this that when no encoding is passed to Right now my tool uses (I know that |
I'm afraid that
The It's unfortunate, but I don't see a simple way to solve this. This is probably a victim of de-coupling. The layer in ripgrep that handles transcoding has zero knowledge of the layer that handles searching and dealing with offsets.
This might be plausible to do, and then you could at least do offset translation at that point. It would be work, but it would be possible. However, I think this is a separate issue and probably requires some careful planning. Would you mind creating a new issue? It would help very much if you could write as much detail as possible about your use case.
It is not actually guessing the encoding. The only "sniffing" it does is BOM detection. BOMs are purportedly very strong indicators of the encoding in the file. If no BOM is present and no explicit encoding is provided by the end user, then ripgrep will always assume an ASCII-compatible (and UTF-8 by convention) encoding. This works decently well for latin-1 for example. |
Ah, I'd written some simple tests that only dealt with a single special character that itself took up two bytes in UTF8 as well, so this part slipped by me.
I guess the simplest way for me to get this working in a somewhat reliable manner, would be to trust the
Here it is: #1629 |
Probably, assuming you're using the same regex engine (which might either be Rust's regex engine or PCRE2). I'm going to close this given the presence of #1629. I don't think there is much that can be done here unfortunately. Or at least, fixing it would require some rather large implementation work that I'm not particularly keen on doing or supporting. |
What version of ripgrep are you using?
How did you install ripgrep?
What operating system are you using ripgrep on?
# Arch Linux Linux 5.4.43-1-lts x86_64 GNU/Linux
Describe your bug.
When using the
--json
flag, theabsolute_offset
provided byrg
doesn't take into account Byte Order Marks of the file's encoding.What are the steps to reproduce the behavior?
Running this:
Produces the following output:
As you can see the match object here:
Provides an
absolute_offset
of0
. If we read the range of bytes reported, we get this:We get the UTF16LE BOM, not the matched portion that
rg
found:# Note that the trailing `00` is outputted by `cut` 00000000: fffe 00 ...
What is the expected behavior?
I'm not sure if this is the expected behaviour or not, but I expected that the range provided by
rg
would map to the precise location of the matched bytes inside the file.If this is not the case, is there a way we can somehow surface this information? 🤔
I'm making an interactive replacement tool for ripgrep, and right now I have to detect the presence of BOMs myself and increase the
absolute_offset
accordingly.The text was updated successfully, but these errors were encountered: