Skip to content

Atomics-hub/logpack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logpack

Columnar compression for JSON logs. 30-50% smaller than zstd, 20x faster.

cargo install logpack
logpack compress app.jsonl -o app.lpk --chunk-size 50000
logpack decompress app.lpk -o app.jsonl
logpack decompress app.lpk --max-frame-bytes $((1024*1024*1024)) --max-column-bytes $((512*1024*1024))

Why

Generic compressors treat logs as opaque bytes. logpack understands the structure: it extracts schemas, splits values into typed columns, detects log message templates, and encodes each data type optimally. The result is significantly better compression on structured JSON logs — the format used by every modern logging system.

Benchmarks

All benchmarks are lossless roundtrips. Speed: 35-45 MB/s (vs zstd -19 at ~2 MB/s).

Structured JSON Logs

Dataset Rows Input logpack zstd -19 vs zstd-19
Webserver logs 500K 128 MB 47.4x 22.1x 53% smaller
Mixed service logs 100K 12 MB 30.1x 16.8x 44% smaller
Logs with UUIDs 500K 158 MB 24.2x 16.8x 30% smaller

Real-World Logs (LogHub)

Dataset Rows logpack zstd -19 vs zstd-19
OpenStack 100K 10.4x 9.9x 5% smaller
Spark 2K 32.0x 29.7x 6% smaller
HPC 2K 16.6x 15.7x 5% smaller
Zookeeper 2K 22.8x 22.0x 3% smaller
BGL 2K 11.5x 11.4x 1% smaller
HDFS 100K 9.5x 10.3x 8% larger
Linux syslog 100K 8.8x 11.7x 33% larger

logpack wins on structured data. On text-heavy logs with high-cardinality fields (60K unique PIDs, non-standard timestamps), zstd's sliding window has an edge.

Library Usage

use logpack::{compress, decompress, decompress_with_options, CompressOptions, DecompressOptions};
use std::fs::File;
use std::io::BufWriter;

let input = File::open("app.jsonl")?;
let output = BufWriter::new(File::create("app.lpk")?);
let compress_opts = CompressOptions { chunk_size: 50_000, ..Default::default() };
compress(input, output, &compress_opts)?;

let input = File::open("app.lpk")?;
let output = BufWriter::new(File::create("app.jsonl")?);
decompress(input, output)?;

let input = File::open("app.lpk")?;
let output = BufWriter::new(File::create("trusted.jsonl")?);
let mut decompress_opts = DecompressOptions::default();
decompress_opts.max_frame_bytes = 1024 * 1024 * 1024;
decompress_opts.max_column_bytes = 512 * 1024 * 1024;
decompress_with_options(input, output, &decompress_opts)?;

How It Works

  1. Schema extraction -- JSON keys repeat on every line. Store the structure once.
  2. Columnar split -- Group values by field. All timestamps together, all log levels together.
  3. Template extraction -- Log messages like "Request from 10.0.1.5 took 42ms" decompose into a template ("Request from <IP> took <N>ms") plus typed variables (IP as 4 bytes, integer as varint).
  4. Type-aware encoding -- 16 encoding strategies selected per column:
    • Timestamps: delta-of-delta with frame-of-reference bit-packing
    • Categoricals: dictionary, RLE, or bit-packed depending on cardinality
    • Integers: delta encoding or dictionary for low-cardinality
    • IPs: 4-byte binary. UUIDs: 16-byte binary. Hex strings: nibble-packed.
    • Booleans: 1 bit each
  5. Adaptive zstd -- Each column compressed independently with tuned zstd levels.
  6. Streaming -- Processes input in 100K-row chunks. Constant memory regardless of file size.

Safety Limits

logpack now rejects suspicious or excessively large frames during decompression instead of allowing unbounded memory growth.

  • Default frame limit: 512 MiB decompressed across one frame
  • Default column limit: 256 MiB decompressed for one column payload
  • Default frame row limit: 1,000,000 rows

For trusted files, you can raise the limits:

logpack decompress app.lpk --max-frame-bytes $((1024*1024*1024)) --max-column-bytes $((512*1024*1024))

To avoid hitting these limits on newly written files, reduce compression frame size:

logpack compress app.jsonl -o app.lpk --chunk-size 50000

Format

.lpk files use a chunked frame format:

LPAK + version byte
[frame]*
  row_count | zstd_dictionary | schema_registry | schema_ids
  per column: name | encoding | null_bitmap | compressed_data
end marker (row_count = 0)

Version 4 frames also store exact uncompressed sizes for compressed sections, which lets the decoder validate expected output sizes precisely.

Lossless roundtrip. Key order may differ (JSON objects are unordered by spec).

Supported Input

  • JSONL (newline-delimited JSON), one object per line
  • Mixed schemas (different objects can have different keys)
  • UTF-8

License

MIT

About

Columnar compression for JSON logs. 30-50% smaller than zstd, 20x faster.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages