SPIR-V module binary size / compression #382

Open
dneto0 opened this Issue Aug 27, 2016 · 9 comments

Projects

None yet

5 participants

@dneto0
Contributor
dneto0 commented Aug 27, 2016

We've heard reports that SPIR-V modules are larger than those compiled to other representations.

First, SPIR-V binary encoding is extremely regular and is designed to be very simple to handle. It has lots of redundancy. For example, the SPIRV-Tools binary parser is simple and nearly stateless.

Second, Glslang generates binaries with OpName for many objects (see KhronosGroup/glslang#316) Also, it doesn't attempt to use group decorations.

To make smaller binaries, we need to make tools smarter: Emit less redundant info in the first place, make tools to eliminate redundancy (but still produce valid SPIR-V binaries), and make semantically lossless compression and decompression.

This issue is a brain dump of a few ideas along these lines. (Keep in mind that the SPIRV-Tools must remain unencumbered, including a possible relicensing under the Apache 2 license.)

Random ideas include those that leave the result as valid SPIR-V binary:

  • Compilers top emitting debug info by default. (E.g. glslang, glslc.) Add option -g to emit the debug info
  • Remap Ids, for redundancy removal across many modules (LunarG's spirv-remap)
  • Link shaders together into a single SPIR-V module, to share common declarations (like types), and share helper function bodies.
  • Write transforms for grouping decorations
  • Dead code elimination
  • constant folding
  • redundant value elimination (global value numbering algorithm)

Generic compression ideas:

  • Since SPIR-V is 32-bit word-oriented, use compression algorithms working on a word level. E.g. Huffman encoding of the distinct words in a module.

Low level encoding ideas (stateless):

  • Bounded IDs: ID bound tells you how many bits of a 32-bit could be non-zero. Never have to write out the upper bytes if they are always zero. E.g. if you never have IDs more than 255, then write only a single byte for each ID in the module. (Of course, this breaks the word-orientation.)
  • Bounded integer constants. Same idea, for positive constants. But doesn’t work well for negative numbers in twos-complement. Fortunately almost all integer literals in a module are unsigned. (Exceptions are the values for OpConstant, OpSpecConstant, and in an OpSwitch, if I remember correctly)
  • Use "varint" style encoding of integers as in protocol buffers. It's a nice variable-number-of-bytes encoding of integers that gracefully handles negative numbers using a zig-zag encoding (signed-magnitude, where the sign bit is the LSB). Again, knowledge of the SPIR-V grammar tells you when you have an unsigned value, save a bit by not encoding the sign.
  • Lots of instructions have type IDs that can directly be inferred from their arguments, assuming valid SPIR-V. Some, like OpIAdd have a result type that is very restricted, e.g. either the signed or unsigned form of an argument type.

Stateful encoding:

  • Instead of writing out an Id value as itself, the encoder uses a dynamically updated table of IDs (mirrored in the decoder) to generate smaller values in the emitted binary. For example an arithmetic instruction will often operate on recently-generated values. So you can use a simple move-to-front heuristic to maintain the table of "most recently mentioned ids". E.g. if the most recently mentioned IDs are %a, %b, %c in that order, then a move-to-front heuristic will put %c in slot 0, %b in slot 1, %a in slot 2. Then to encode %d = OpIAdd %int %a %b then emit the instruction but use 2 for %a and 1 for %b, and suitably update the table. This works well with the varint encodings.
  • There are several other variants of the previous idea, not just move-to-front.
  • Types IDs can be in their own table, since they have different locality characteristics.
  • (Constants might best get separate treatment)
  • Don't explicitly write out result ids. Just generate them implicitly. (Requires mirroring of state between encoder and decoder.)

Anyway, this is just a start of what we could do.

@aras-p
aras-p commented Aug 27, 2016

Good stuff! In my mind, there's several goals, and several approaches here:

Given a SPIR-V program:

  • Make it smaller (stripping debug info, optimizing instructions themselves, grouping decorations etc. etc.). Some of that is changing common tools that SPIR-V.
  • Make it more compressible, while still keeping it a valid SPIR-V that does the same thing. This is what spirv-remap does.
  • Transform it into some other (more compact) form, that does the same, but is no longer a SPIR-V format. This is either to make it just smaller, and/or to make it more compressible too (similar to "filters" in compression, e.g. PNG pixel prediction filters). E.g. spirv-remap can't really do much transforms to aid compression, since it has to stay valid SPIR-V. But without this restriction, perhaps much more could be done.

I'm playing around with various approaches for the last item -- the "varint" encoding you mentioned, also gonna try delta-encoding the IDs (in almost all the programs I have here, majority of IDs are 1-2 away from IDs used in previous instruction).

I have also seen some fairly unexpected stats from the shaders I have, e.g. OpVectorShuffle takes up a lot total space - it's a very verbose encoding (very often 9 words), in most cases just representing a swizzle or a scalar splat. Have to dig in more to find whether there are similar patterns that could be either encoded more compactly, or encoded in a way that's more compressible.

@Themaister
Themaister commented Aug 27, 2016 edited

With shader variants, you'd probably find that the same basic blocks are found over and over, with maybe just shifted IDs. Could be interesting to have an archive format where SPIR-V files would do something like OpLabelLink $hash $id-shift, and basic blocks could be inlined in runtime before passing to driver.

@aras-p
aras-p commented Aug 27, 2016 edited

Experimenting with varint encoding + some delta encoding. Results look quite promising. Messy code on github (https://github.com/aras-p/smol-v -- Win/Mac builds)

But, testing on 113 shaders I have right now (caveat emptor: all produced by HLSL -> d3dcompiler -> DX11 bytecode -> HLSLcc -> GLSL -> glslang, so they might have some patterns that aren't "common" elsewhere):

Compression: original size 1314.8KB
0 Remap       1314.1KB  99.9%  glslang spirv-remap
0 SMOL-V       448.3KB  34.1% "my stupid code"

Compressed with LZ4 HC compressor at default settings ("re" = remapper, "sm" - my test):
1    LZ4HC     329.9KB  25.1%
1 re+LZ4HC     241.8KB  18.4%
1 sm+LZ4HC     128.0KB   9.7%

Compressed with Zstd compressor at default settings:
2    Zstd      279.3KB  21.2%
2 re+Zstd      188.7KB  14.4%
2 sm+Zstd      117.4KB   8.9%

Compressed with Zstd compressor at almost max setting (20):
3    Zstd20    187.0KB  14.2%
3 re+Zstd20    129.0KB   9.8%
3 sm+Zstd20     92.0KB   7.0%

There's a lot of more instructions I could be encoding (so far just looked at the ones taking up most space), and perhaps other tricks could be done. The shaders I have do have debug names on them, I am not stripping them out.

(edit: updated with August 28 results)

@dneto0
Contributor
dneto0 commented Aug 27, 2016

@aras-p Very promising results! Thanks for sharing.
Agreed: Transforms should be designed to make the result more compressible by standard (universal) compressors.

I also agree that we have to be mindful of using a reasonable tuning set of shaders. Here's a good project for someone: make a public repository of example shaders and define meaningful tuning sets over them.

Other encoding ideas:

  • I wonder how many binary arithmetic operators have one operand being the exactly previously generated value. That might be very common given how code generators work. In that case, make new opcodes that embed that assumption.
    Eg. turn
   %a = ...
   ....
   ...
   %b = ....
   %sum = OpIAdd %int %a %b

into

  %a = ...
   ...
  %b = ...
  %sum = OpImplicitIAdd %int %a    ; will automatically use %b as the other operand

(This reminds of life writing assembly for the 6502.)
The idea here is that instead of encoding the operand explicitly (even with delta coding), it just bloats the instruction opcodes slightly.

  • Swap operands to symmetric opcodes to take advantage of the previous idea. E.g. addition is symmetric.
  • Identifier scoping. The IDs generated in a function body (call them local IDs) can only be accessed (as values) in the same function. So throw away those IDs at the end of the function, allowing them to be reused in the next function body. But you then have to rework how annotations work. E.g. allow an annotations section just inside each function body. (Or even allow decorations anywhere in the function body. If you put them right after the thing they're decorating then you can take advantage of delta coding of the target ID. Just an idea.)
@johnkslang
Member

Good stuff, thanks.

I also want to put in a plug for another dimension to generate less SPIR-V to begin with.

There are two big ones SPIR-V is designed for:

  1. specialization constants
  2. lots of shaders in the same SPIR-V module

For the first, if multiple GLSL shaders are being generated with different values of constants (e.g., fixed sets of elements to process or bool turning features on/off), it is possible to instead make one GLSL shader with specialization constants and wait until app run time to provide the actual constant values needed:

shader A:

const elements vec4[4];
const bool feature = false;
... for (i = 0; i < 4; ++i) ...elements[i]...
... if (feature) ...

shader B:

const elements vec4[8];
const bool feature = true;
... for (i = 0; i < 8; ++i) ...elements[i]...
... if (feature) ...

Single shader with specialization:

layout(constant_id=1) const int numElements = 4; // can be changed to 8 at run time
const elements vec4[numElements];
layout(constant_id=2) const bool feature = false;  // can be changed to true at run time
... for (i = 0; i < numElements; ++i) ...elements[i]...
... if (feature) ...

This turns multiple shaders into a single shader, long before compression even comes into play.

For the second point, @dneto0 already touched on it with:

Link shaders together into a single SPIR-V module, to share common declarations (like types), and share helper function bodies.

It's possible that enough ID remapping and cross-file compression would recognize the commonality (would be good to find out how much that is happening), but if not two other approaches would help:

  • higher-level tooling that translates a set of GLSL shaders into a single SPIR-V module from the start
  • specifically look for and find the common parts across multiple SPIR-V modules, and turn the set of modules into a single module that no longer duplicates the common parts (open question is how much better that is than the compression methods for finding commonality)
@benvanik
Contributor

As a place to look for inspiration the WebAssembly group may have some relevant ideas. They've heavily iterated on efficient instruction encodings that are easy to parse and compress well. They ended up with LEB128 varint encoding for most things, as well as some ways of reducing long instruction encoding (like the 9-word swizzle mentioned above). Some docs here, but if anyone's interested they could ping the group and chat - I'm sure they'd be willing to share what they learned along the way :)

@johnkslang
Member

From @aras-p:

...several approaches here: Given a SPIR-V program:

  • Make it more compressible, while still keeping it a valid SPIR-V that does the same thing. This is what spirv-remap does.

This is an important design constraint to be aware of. The question is whether anything not "off the shelf" is needed on the target (end user) system. Applies to both decompression and denormalization.

Also key is whether multiple files are seen just at compression time, or earlier at normalization/remapping time.

The fuller taxonomy is more like:

  • Can it do normalization/remapping/etc. per SPIR-V module, or across SPIR-V modules?
  • Does the normalization require denormalizing at the other end, or did it result in valid SPIR-V?
  • Is the compression scheme itself off-the-shelf, or custom?

All combinations make sense. The remapper was indeed intentionally targeting the combination of

  • normalize per module, not across modules
  • no denormalizer needed on the other end
  • off-the-shelf compression

These are all constraints, and certainly lifting any of them would enable a tool to perform better.

So, I'm curious to what extent gains were seen by lifting the constraints and to what extent by finding more ways of doing better normalization.

@aras-p
aras-p commented Sep 3, 2016

I wrote up what I did so far here: http://aras-p.info/blog/2016/09/01/SPIR-V-Compression/

And indeed, the combination I chose is somewhat different from the remapper. I did this:

  • normalize per-module, not across (this is same as remap)
  • denormalizer needed on the other end (this is different!)
  • off-the-shelf compression (this is the same)

Now, my "normalization/denormalization" step also makes it smaller, so you could view it as some sort of compression too. But it's not a dictionary/entropy compression, so you can still compress it afterwards with regular off-the shelf compressors.

@johnkslang
Member

Nice write-up, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment