Skip to content

Streaming compression and decompression corruption #112

@GUI

Description

@GUI

In trying to use Zstd::StreamingCompress.new, I seem to have encountered a couple different data corruption issues. These present themselves in two different ways:

  1. If you use Zstd::StreamingCompress.new and use << or .write to pass in a string from ~16,000 bytes to 131,072 bytes, it will sometimes randomly generate a resulting compressed string that Zstd.decompress will generate differing output than the input for (it doesn't fail, but the output is different and corrupt). However, the compressed string is capable of being decompressed successfully by the zstd CLI tool, and in that case it matches the input exactly, so the compression doesn't necessarily seem corrupt, but Zstd.decompress itself seems to not be able to handle this string.
  2. If you use Zstd::StreamingCompress.new and use << or .write to pass in a string of 131,072 bytes or more in length, then Zstd::StreamingCompress will consistently produce corrupt content that cannot be decompressed (by either the gem or the CLI tool).
  3. If you use Zstd::StreamingCompress.new and use .compress (instead of << or .write) to pass in data, then it only exhibits the first issue above, and not the second issue. So data passed to .compress greater than ~16,000 bytes will randomly generate corrupt output if passed to Zstd.decompress, but it will work via the zstd CLI. However, instead of the second issue above with data greater than 131,072 consistently failing completely, these longer strings will still exhibit the first issue (mismatched Zstd.decompress output).

I cannot reproduce these issues if I use Zstd.compress, so this seems specific to the streaming compression.

Reproduction script

I've reproduce this with both zstd-ruby 1.5.7.0 and 2.0.0.pre.preview1 on the following platforms:

ruby 3.4.5 (2025-07-16 revision 20cda200d3) +PRISM [arm64-darwin24]

ruby 3.4.5 (2025-07-16 revision 20cda200d3) +PRISM [x86_64-linux]

Here's my attempt at a script to reproduce this if you save the following to test_zstd.rb. Sorry it's maybe a bit convoluted to test all 3 situations, but more explanation on usage and some abbreviated output below:

require "bundler/inline"
require "digest"
require "tempfile"

gemfile do
  source "https://rubygems.org"
  gem "zstd-ruby", "1.5.7.0"
end

def compare_compressed(original:, compressed:)
  begin
    decompressed = Zstd.decompress(compressed)
  rescue => e
    decompress_error = e
  end

  if original != decompressed
    if decompress_error
      puts "Decompression error for #{original.bytesize} bytes input (#{decompress_error})"
    else
      puts "Content mismatch for #{original.bytesize} bytes input"
    end

    puts "  Original:        #{original.bytesize} bytes, #{Digest::SHA256.hexdigest(original)[0, 10]} checksum"

    if decompressed
      puts "  Zstd.decompress: #{decompressed.bytesize} bytes, #{Digest::SHA256.hexdigest(decompressed)[0, 10]} checksum"
    end

    begin
      cli_decompressed = Tempfile.create(binmode: true) do |temp_write|
        temp_write.write(compressed)
        temp_write.close
        Tempfile.create(binmode: true) do |temp_read|
          system "zstd", "--decompress", "--quiet", "--force", "-o", temp_read.path, temp_write.path, exception: true
          File.read(temp_read.path, binmode: true)
        end
      end

      puts "  zstd cli:        #{cli_decompressed.bytesize} bytes, #{Digest::SHA256.hexdigest(cli_decompressed)[0, 10]} checksum"
    rescue => e
      puts "  zstd cli error: #{e}"
    end
  end
end

def test_stream_write
  (1..256_000).each do |length|
    original = "a" * length

    stream = Zstd::StreamingCompress.new
    stream << original
    res = stream.finish

    compare_compressed(original: original, compressed: res)
  end
end

def test_stream_compress
  (1..256_000).each do |length|
    original = "a" * length

    stream = Zstd::StreamingCompress.new
    res = stream.compress(original)
    res << stream.finish

    compare_compressed(original: original, compressed: res)
  end
end

def test_compress
  (1..256_000).each do |length|
    original = "a" * length

    res = Zstd.compress(original)

    compare_compressed(original: original, compressed: res)
  end
end

case ARGV[0]
when "stream_write"
  puts "=== Zstd::StreamingCompress.new with << ==="
  test_stream_write
when "stream_compress"
  puts "=== Zstd::StreamingCompress.new with .compress ==="
  test_stream_compress
when "compress"
  puts "=== Zstd.compress ==="
  test_compress
else
  abort "Unknown test mode: #{ARGV[0].inspect}"
end

Reproduction script usage

  • Run ruby test_zstd.rb stream_write to test Zstd::StreamingCompress.new with << which should exhibit the first issue above randomly for input sizes in the ~16,000-131,072 byte range, and the second issue consistently for inputs greater than or equal to 131,072 bytes.
  • Run ruby test_zstd.rb stream_compress to test Zstd::StreamingCompress.new with .compress which should exhibit the third issue describe above with inputs sizes greater than ~16,000 bytes randomly having issues.
  • RUn ruby test_zstd.rb compress to test Zstd.compress which generates no errors for me.

Reproduction script example output

  • For ruby test_zstd.rb stream_write note that given the Ruby stream compression input, that zstd CLI actually does produce the same output as the original input, even when Zstd.decompress does not (this is what the checksum of the content is in the output for). However, once you get to 131,072 bytes, then all decompression starts to fail completely.

    === Zstd::StreamingCompress.new with << ===
    Content mismatch for 16937 bytes input
      Original:        16937 bytes, e31be8f076 checksum
      Zstd.decompress: 16937 bytes, 41fb77f572 checksum
      zstd cli:        16937 bytes, e31be8f076 checksum
    Content mismatch for 21515 bytes input
      Original:        21515 bytes, 1cccd688d3 checksum
      Zstd.decompress: 21515 bytes, 6695186c17 checksum
      zstd cli:        21515 bytes, 1cccd688d3 checksum
    [...]
    Content mismatch for 97075 bytes input
      Original:        97075 bytes, 1f642ed1a7 checksum
      Zstd.decompress: 97075 bytes, acc3eb205e checksum
      zstd cli:        97075 bytes, 1f642ed1a7 checksum
    Decompression error for 131072 bytes input (not compressed by zstd: Unspecified error code)
      Original:        131072 bytes, b44ffb72fc checksum
    zstd: /var/folders/td/52lw67lj0wz36_rhqflz24_19mm_gh/T/20250813-83789-e4qsiy: unknown header 
      zstd cli error: Command failed with exit 1: zstd
    Decompression error for 131073 bytes input (not compressed by zstd: Unspecified error code)
      Original:        131073 bytes, 7e009ea4ef checksum
    zstd: /var/folders/td/52lw67lj0wz36_rhqflz24_19mm_gh/T/20250813-83789-uzf2ug: unsupported format 
      zstd cli error: Command failed with exit 1: zstd
    [...]
    
  • For ruby test_zstd.rb stream_compress note it still exhibits the first issue, but it behaves the same above 131,072 bytes of input:

    === Zstd::StreamingCompress.new with .compress ===
    Content mismatch for 16671 bytes input
      Original:        16671 bytes, 75fa71ee56 checksum
      Zstd.decompress: 16671 bytes, fefcb91c80 checksum
      zstd cli:        16671 bytes, 75fa71ee56 checksum
    Content mismatch for 16936 bytes input
      Original:        16936 bytes, a9d4d5bb65 checksum
      Zstd.decompress: 16936 bytes, adfd120dfd checksum
      zstd cli:        16936 bytes, a9d4d5bb65 checksum
    [...]
    Content mismatch for 98731 bytes input
      Original:        98731 bytes, 093614bb66 checksum
      Zstd.decompress: 98731 bytes, fa9fe924fd checksum
      zstd cli:        98731 bytes, 093614bb66 checksum
    Content mismatch for 131185 bytes input
      Original:        131185 bytes, b5b6d4d116 checksum
      Zstd.decompress: 131185 bytes, 16ab74a052 checksum
      zstd cli:        131185 bytes, b5b6d4d116 checksum
    [...]
    Content mismatch for 244541 bytes input
      Original:        244541 bytes, 951b4d7ef8 checksum
      Zstd.decompress: 244541 bytes, 170cce21a4 checksum
      zstd cli:        244541 bytes, 951b4d7ef8 checksum
    Content mismatch for 247016 bytes input
      Original:        247016 bytes, 2b51d7363f checksum
      Zstd.decompress: 247016 bytes, c5bc5222b8 checksum
      zstd cli:        247016 bytes, 2b51d7363f checksum
    
  • For ruby test_zstd.rb compress when not using the streaming compressor, it seems like everything works and the tests produce no output of mismatched things:

    === Zstd.compress ===
    

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions