Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FemtoZipCompressionModel.encodeSubstring encodes a four-byte offset #5

Open
ehrmann opened this issue Dec 14, 2012 · 3 comments
Open

Comments

@ehrmann
Copy link
Contributor

ehrmann commented Dec 14, 2012

Why does this method encode a four-byte offset when offset is only two bytes?

public void encodeSubstring(int offset, int length, Object context) {
        ...
        offset = -offset;
        if (offset < 1 || offset > (2<<15)-1) {
            throw new IllegalArgumentException("Offset " + offset + " out of range [1, 65535]");
        }
        encoder.encodeSymbol(offset & 0xf);
        encoder.encodeSymbol((offset >> 4) & 0xf);
        encoder.encodeSymbol((offset >> 8) & 0xf);
        encoder.encodeSymbol((offset >> 12) & 0xf);
        ...
}
@gtoubassi
Copy link
Owner

Hi Dave! Thanks for the question. It is encoding the 2 byte offset as 4 symbols, one per nibble (half byte). The encoder creates a huffman coding table for each nibble, and it turns out that scheme is more efficient (at least experimentally determined by me) then encoding it as 2 single byte values, each with its own huffman table. I believe the reason for this is that most of the length values fit in a nibble, so separating that high nibble out which is often a zero is high value.

So in short, the values are not written out directly, they are written as symbols compressed with huffman coding, and you have to make a decision as to how to map an integer into a huffman encoding (do you have 1 huffman table for the entire 4 bytes? break it into 2 bytes and have 2 tables, 1 for each part? Break it into 4 parts and have 4 tables)?

Please let me know if that doesn't make sense and I can explain further.

@ehrmann
Copy link
Contributor Author

ehrmann commented Dec 17, 2012

I think that mostly makes sense, but I'm still a little surprised that huffman encoding beats a varint.

Have you gotten a chance to look at some of my pull requests? I think I had a few performance tweaks, a build fix, and a bug fix for the Java version.

@gtoubassi
Copy link
Owner

I assume when you say varint you are referring to an ad hoc encoding that tries to encode a small number as a single byte, and then uses a sentinel value to indicate that the subsequent bytes represent the full number? Is there a specific implementation you have in mind that I could contemplate? One thing to keep in mind is that the numbers being encoded are often very small (like < 16). So I guess the basic question is what is the best variable encoding for numbers that are often very very small but can range up to 65k. Huffman encoding should always beat it since its a similar encoding, but optimized for the data set at hand. In fact assuming you have suitable sample data to generate the model, the huffman coding is provably optimal for a symbol to symbol encoding (vs say arithmetic coding, which I found empirically to not be much better then huffman coding and comes with significant performance cost and lots of IP fud).

Perhaps whats more surprising is that I found breaking the "short" into 4 encoded nibbles was best.

Regarding your pull requests. All the changes look good to me although I was wondering about the intro of the "Match" object (which I commented on). I'm inclined to believe you but for such a change want to know there was something empirical backing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants