GPU-Accelerated Lossless Compression — High Ratio at High Speed
Pre-built binaries, benchmark datasets, and verification tools for independent validation and showcasing. Source code is not included — APEX is in active development.
APEX achieves high compression ratios at high throughput — a combination that has traditionally required choosing one or the other. GPU-accelerated for maximum performance, with a dedicated CPU-only binary for systems without NVIDIA GPUs.
| Data Type | Ratio | Compress | Decompress | Config |
|---|---|---|---|---|
| Mixed corpus (Silesia 202MB) | 4.00x | 541 MB/s | 672 MB/s | Par 6MB |
| Server logs (Spark 2.8GB) | 28.35x | 1,257 MB/s | 1,545 MB/s | Par 16MB |
| Structured data (JSON 1.1GB) | 18.11x | 1,642 MB/s | 2,022 MB/s | Par 18MB |
| HPC logs (BGL 709MB) | 17.32x | 767 MB/s | 1,102 MB/s | Par 12MB |
| Source code (Linux Kernel 1.5GB) | 9.26x | 817 MB/s | 999 MB/s | Par 12MB |
| Financial tick data (Binance 612MB) | 7.27x | 531 MB/s | 682 MB/s | Par 6MB |
| Analytics export (IMDb 2.6GB TSV) | 5.36x | 583 MB/s | 719 MB/s | Par 6MB |
| Genomic data (Human Genome 3GB) | 4.36x | 479 MB/s | 757 MB/s | Par 8MB |
Numbers above are from a consumer laptop (RTX 5070 Laptop, 8 GB GDDR7, 16 GB RAM) — not a server or workstation. No per-dataset tuning — out-of-the-box performance. RTX 5090 results: up to 1,899 MB/s compress, 4,403 MB/s decompress. Server-class hardware would be expected to improve further. See BENCHMARKS.md for all 21 datasets across 3 systems.
Tested on 3 systems with different GPUs (RTX 5070, 4090, 5090) and CPUs (Zen 2, Zen 4). Ratios are deterministic — identical across all hardware. Speeds scale with GPU compute and CPU core count.
No GPU? APEX still works. CPU-only mode: 826 MB/s on JSON, 253 MB/s on Linux Kernel, 131 MB/s at 4.0x on Silesia — faster than every BWT compressor without GPU. See CPU-Only Mode.
RAM note: This testing binary reads the full file into memory before processing. bench needs ~3x file size in RAM, compress/decompress need ~1.5x. The compression algorithm itself is block-based and does not require the full file in memory — this is specific to the current testing CLI, not an algorithm constraint. See memory details.
vs zstd on server logs: On enterprise log data (Spark, HDFS, BGL), APEX achieves 21-34% better ratio than zstd at any level while matching or exceeding zstd's compress speed. Tested across zstd levels 9/12/15 with 6-14 threads. No zstd configuration reaches APEX's ratio — BWT captures log template repetition that LZ77 cannot. Full comparison in BENCHMARKS.md.
Full results, 3-system comparison, CPU-only benchmarks, and vs-competition in BENCHMARKS.md.
APEX is in active development. This binary is shared for community validation — verify the claims on your own hardware.
Don't want to run an unknown binary on your machine? Completely understandable — running closed-source binaries from the internet requires trust, and we haven't earned that yet. The included
verify.shvalidates everything using standard Unix tools (md5sum,stat,cmp) without trusting APEX's own output, but that still means running the binary. If you're genuinely interested in testing but don't want to run it on your hardware, I'll provision a cloud GPU instance for you on Vast.ai (or any provider you prefer) at my expense — you launch it, test freely, and tear it down when done. I can also add you to my Vast.ai team so you can provision any instance yourself. From clone to full benchmark results takes ~10 minutes: Quick setup guide. Reach out at ritik135001@gmail.com.
| System | GPU | CPU | Best Compress | Best Decompress |
|---|---|---|---|---|
| Dev machine (laptop) | RTX 5070 Laptop | Ryzen 9 8940HX (Zen 4) | 1,642 MB/s | 2,022 MB/s |
| Vast.ai cloud | RTX 4090 | EPYC 7D12 (Zen 2) | 1,793 MB/s | 2,324 MB/s |
| Vast.ai cloud | RTX 5090 | Dual EPYC 7742 (Zen 2) | 1,899 MB/s | 4,403 MB/s |
Ratios match exactly across all 3 systems. All round-trip verified PASS. Full results in BENCHMARKS.md.
Note: Speed depends on both GPU and CPU. See CPU Architecture Effects for details on how Zen 2 vs Zen 4, AVX2 vs AVX-512, and clock speed affect results.
This repo serves two purposes — showcasing what APEX can do and letting you verify it independently.
We've published results showing high ratio at high speed — a combination no existing compressor achieves. Rather than asking you to take our word for it, this binary lets you:
- Validate — Run the same benchmarks on your hardware and verify the claims
- Reproduce — Download the exact datasets we used and reproduce our methodology
- Compare — Test against zstd, bzip2, libbsc, or any compressor on the same data
- Verify correctness — Every compressed file round-trips to a byte-identical original (PASS/FAIL)
- Explore — Test on your own data, find optimal configs for your workload
If you find APEX useful or interesting, reach out at ritik135001@gmail.com.
| File | Size | GPU | CPU | Description |
|---|---|---|---|---|
apex |
16 MB | Yes | AVX2 | Default — GPU + any modern CPU |
apex-gpu-avx2 |
16 MB | Yes | AVX2 | Same as apex |
apex-gpu-avx512 |
16 MB | Yes | AVX-512 | GPU + AVX-512 CPU (Zen 4+, Intel 12th gen+). Faster decompress. |
apex-cpu-avx2 |
1.3 MB | No | AVX2 | CPU-only — no CUDA needed. Any CPU from 2013+. |
apex-cpu-avx512 |
1.3 MB | No | AVX-512 | CPU-only — no CUDA needed. Zen 4+, Intel 12th gen+. Fastest CPU-only. |
apex-cpu-sse42 |
1.1 MB | No | SSE4.2 | CPU-only — oldest CPUs. Sandy Bridge+ (2011+). No AVX needed. |
download_datasets.sh |
15 KB | Downloads benchmark datasets into data/ |
||
verify.sh |
7 KB | Independent verification using standard Unix tools | ||
sysinfo.sh |
3 KB | Prints full system info (CPU, GPU, RAM, CUDA, OS) | ||
BENCHMARKS.md |
Full results: 21 datasets, 3 systems, CPU-only, vs-competition | |||
LICENSE |
Testing license |
For best performance, use the binary that matches your system. The
*-avx512variants are significantly faster on decompress (+77-122%) and slightly faster on compress (~10-15%). Using the wrong variant won't give wrong results —apex(default) always works — but you'll leave performance on the table.
Step 1: Check your system:
# Do you have an NVIDIA GPU?
nvidia-smi
# Shows GPU info → you have a GPU
# "command not found" → no GPU, use apex-cpu-*
# Does your CPU support AVX-512?
grep -c avx512 /proc/cpuinfo
# Number > 0 → YES, use *-avx512 variant for best speed
# 0 → NO, use *-avx2 variant (or default apex)
# Quick one-liner to tell you which binary to use:
if nvidia-smi &>/dev/null; then
if grep -q avx512 /proc/cpuinfo; then echo "Use: apex-gpu-avx512";
else echo "Use: apex (default)"; fi
else
if grep -q avx512 /proc/cpuinfo; then echo "Use: apex-cpu-avx512";
elif grep -q avx2 /proc/cpuinfo; then echo "Use: apex-cpu-avx2";
else echo "Use: apex-cpu-sse42"; fi
fiStep 2: Pick your binary:
Do you have an NVIDIA GPU + CUDA?
├─ YES → Does your CPU support AVX-512?
│ ├─ YES (Zen 4+, Intel 12th+) → apex-gpu-avx512 (fastest)
│ └─ NO (older CPU) → apex (default, always works)
└─ NO → Does your CPU support AVX2?
├─ YES + AVX-512 → apex-cpu-avx512 (fastest CPU-only)
├─ YES → apex-cpu-avx2 (any CPU from 2013+)
└─ NO (very old) → apex-cpu-sse42 (Sandy Bridge+ 2011+)
| Your System | Best Binary | What it needs |
|---|---|---|
| NVIDIA GPU + AVX-512 CPU | apex-gpu-avx512 |
CUDA + AVX-512 |
| NVIDIA GPU + older CPU | apex (default) |
CUDA + AVX2 |
| No GPU + AVX-512 CPU | apex-cpu-avx512 |
Just AVX-512 |
| No GPU + AVX2 CPU (2013+) | apex-cpu-avx2 |
Just AVX2 |
| No GPU + old CPU (2011+) | apex-cpu-sse42 |
Just SSE4.2 |
Note on
apex-cpu-sse42: This binary is verified to contain zero AVX/AVX2/AVX-512 instructions (confirmed viaobjdump). Round-trip tested and cross-compatible with all other binaries. However, it has not been tested on actual pre-AVX2 hardware (Sandy Bridge/Ivy Bridge). If you have such hardware and test it, please share your results.
Note: Using
apex(default) on an AVX-512 CPU works fine — correct results, good speed. You just won't get the extra decompress boost thatapex-gpu-avx512provides. It never crashes; it just doesn't use the wider instructions.
No NVIDIA GPU? No CUDA? No problem. Use apex-cpu-avx2 (or apex-cpu-avx512 for Zen 4+):
chmod +x apex-cpu-avx2
./apex-cpu-avx2 --help # Shows "GPU: Disabled (CPU-only mode)"
./apex-cpu-avx2 bench data/silesia.tar
./apex-cpu-avx2 compress myfile.tar myfile.apex -mt
./apex-cpu-avx2 decompress myfile.apex restored.tarAll binaries produce identical compressed files — same format, same ratios. A file compressed with apex-cpu-avx2 can be decompressed with apex-gpu-avx512 and vice versa.
CPU-only APEX is faster than every BWT compressor (bsc, bzip2, bzip3, LZMA) even without GPU. See CPU-Only benchmarks.
Source code is not part of this release. APEX is in development and will be released when ready. This binary is provided specifically for the purpose of community validation and independent benchmarking.
New here? Start with Quick Start, then run
./apex tune mydata.tar— it tests all configurations on YOUR data and recommends the best one.
| # | Section | What You'll Find |
|---|---|---|
| 1 | System Requirements | What hardware and software you need |
| 2 | Setup Checklist | Step-by-step verification before running |
| 3 | Quick Start | From download to first benchmark in 30 seconds |
| 4 | Understanding the Output | What every line of output means |
| 5 | All Commands | Every command and flag with examples |
| 6 | Default Behavior | What happens when you run each mode |
| 7 | Download Datasets | Get the exact datasets we benchmarked |
| 8 | Reproduce Our Benchmarks | Step-by-step to reproduce our published numbers |
| 9 | Hardware Configuration | Adapt APEX to your CPU, GPU, and RAM |
| 10 | Advanced Tuning | Block size selection, preprocessing, flag combinations |
| 11 | Reference Numbers | Our published results (your target to match/beat) |
| 12 | Troubleshooting | Common issues and fixes |
| 13 | Reporting Results | How to share your findings with the community |
All published benchmark numbers were measured on this exact configuration:
| Component | Specification |
|---|---|
| Machine | ASUS TUF Gaming A16 (2025) — laptop |
| CPU | AMD Ryzen 9 8940HX (Zen 4, 16 cores / 32 threads, 1MB L2/core, 64MB L3, 5.4 GHz boost) |
| GPU | NVIDIA RTX 5070 Laptop (Blackwell GB206, 8GB GDDR7, 384 GB/s bandwidth) |
| RAM | 16GB DDR5-5200 single-channel (~40 GB/s) |
| Storage | Samsung 1TB NVMe Gen4 (~6.5 GB/s sequential) |
| OS | Ubuntu 24.04.4 LTS, Kernel 6.17.0 |
| CUDA | 13.2.51, Driver 580.126.09 |
| Power | Plugged in (AC), ASUS Performance mode (max fans), CPU governor: performance |
All benchmarks were run plugged in with Performance thermal profile enabled. Battery mode reduces GPU power from ~115W to ~47W, significantly lowering speeds. If you're on a laptop, make sure you're plugged in and in performance mode for comparable results.
Your results will differ based on your hardware. That's the point — we want to see how APEX performs across different systems.
APEX is benchmarked on 21 real-world datasets across multiple domains — standard benchmarks, enterprise production data, and real-world downloads.
| Category | Datasets | Why it matters |
|---|---|---|
| Standard benchmarks | Silesia (202MB), enwik8 (96MB), enwik9 (954MB) | Industry-standard. Every compressor publishes these. Directly comparable. |
| Source code | Linux Kernel (1.5GB), LLVM (2.4GB) | Real production codebases. Tests scaling on large repetitive data. |
| Server logs | Spark (2.8GB), HDFS (1.5GB), BGL (709MB) | Enterprise log pipelines — the data Datadog/Splunk/Elastic ingest daily. |
| Financial data | Binance BTC (3.7GB), Binance BNB (612MB) | Real exchange tick data. CSV with prices, quantities, timestamps. |
| Analytics/data lake | IMDb TSV (2.6GB), GH Events JSON (480MB), Large JSON (1.1GB), Wiki SQL, CSV | Database exports, API logs, tabular data. |
| Genomics | Human Genome GRCh38 (3.0GB) | Real DNA reference genome. BWT is native to this domain. |
| Incompressible | Firefox (79MB), Taxi Parquet (48MB+659MB) | Already-compressed data. APEX detects and stores RAW at memcpy speed. |
All datasets are publicly downloadable. No synthetic or generated data. The included download_datasets.sh fetches the same files we used — you test on exactly what we tested on.
| Requirement | Minimum | How to Check |
|---|---|---|
| OS | Linux x86-64 (Ubuntu 22.04+, Fedora 38+, Arch) | uname -m should show x86_64 |
| CPU | Any x86-64 with AVX2 (~2015 onwards) | grep avx2 /proc/cpuinfo |
| RAM | 4 GB (for small files) | free -h |
| Disk | 1 GB free (for binary + one dataset) | df -h . |
| Requirement | Minimum | How to Check |
|---|---|---|
| NVIDIA GPU | Turing or newer (RTX 20xx+, T4+) | nvidia-smi |
| NVIDIA Driver | 525+ | nvidia-smi (top row shows driver version) |
| CUDA Toolkit | 12.0+ | nvcc --version |
| VRAM | 6 GB+ (8 GB recommended) | nvidia-smi (shows memory) |
| Generation | GPUs | Year |
|---|---|---|
| Turing | RTX 2060-2080 Ti, GTX 1660 Ti, T4 | 2018-2019 |
| Ampere | RTX 3060-3090, A100, A10, A30 | 2020-2021 |
| Lovelace | RTX 4060-4090, L40, L40S | 2022-2023 |
| Hopper | H100, H200 | 2023-2024 |
| Blackwell | RTX 5070-5090, B100, B200 | 2024-2025 |
No GPU? APEX still works — it falls back to CPU-only mode automatically. Same compression ratio, just slower speed.
Quickest way: Run the included sysinfo.sh script — it checks everything at once:
chmod +x sysinfo.sh
./sysinfo.shThis prints your CPU (model, cores, clock, AVX2/AVX-512), GPU (model, VRAM, driver), RAM, CUDA version, OS, and APEX status. Copy-paste the output when sharing benchmark results.
Or check manually, one by one:
# 1. Check OS
uname -m
# Expected: x86_64
# 2. Check CPU supports AVX2
grep -c avx2 /proc/cpuinfo
# Expected: a number > 0
# 3. Check NVIDIA driver
nvidia-smi
# Expected: shows GPU name, driver version, VRAM
# If "command not found": no NVIDIA driver installed (APEX will use CPU-only mode)
# 4. Check CUDA toolkit
nvcc --version
# Expected: shows CUDA version 12.0+
# If "command not found": install CUDA toolkit (see below)
# 5. Check available RAM
free -h
# Look at "available" column — need at least 2-4 GB free
# 6. Check disk space
df -h .
# Need ~10 GB for datasets + compressed files# Ubuntu 22.04/24.04:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update && sudo apt install -y cuda-toolkit-13-2
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
nvcc --version # Should now work
# Fedora:
sudo dnf install cuda-toolkit
# If you don't want to install CUDA:
# Use apex-cpu-avx2 instead — no CUDA needed, full CPU performance# Clone the repo
git clone https://github.com/Rkcr7/apex-testing.git
cd apex-testing
# Make binaries executable
chmod +x apex apex-gpu-avx2 apex-gpu-avx512 apex-cpu-avx2 apex-cpu-avx512 apex-cpu-sse42 verify.sh download_datasets.sh sysinfo.sh
# Print your system info (share this with benchmarks)
./sysinfo.sh
# Check it works
./apex --help
# You should see: APEX version, SIMD tier, GPU status, worker count
# If you have AVX-512: use ./apex-gpu-avx512 instead for best performance
# Download benchmark datasets
./download_datasets.sh
# Benchmark Silesia
./apex bench data/silesia.tar
# Benchmark ALL 5 datasets at once (with cooldown, saves to results.txt)
for f in data/silesia.tar data/enwik9 data/realworld/linux-kernel.tar data/realworld/large_json_1gb.json data/realworld/grch38.fna; do echo ""; echo "=== $f ==="; ./apex bench "$f"; sleep 10; done 2>&1 | tee results.txt
# Compress a file
./apex compress myfile.tar myfile.apex -mt
# Decompress
./apex decompress myfile.apex restored.tar
# Verify it's lossless (byte-perfect)
cmp myfile.tar restored.tar && echo "PASS: Files are identical"git clone https://github.com/Rkcr7/apex-testing.git
cd apex-testing
chmod +x apex-cpu-avx2 apex-cpu-avx512 apex-cpu-sse42 verify.sh download_datasets.sh sysinfo.sh
./sysinfo.sh # Check system
./download_datasets.sh # Download datasets
./apex-cpu-avx2 bench data/silesia.tar # Benchmark (CPU-only)
./apex-cpu-avx2 compress myfile.tar myfile.apex -mt
./apex-cpu-avx2 decompress myfile.apex restored.tar
# If your CPU has AVX-512: use apex-cpu-avx512 instead for best speedAPEX 0.1.0 — GPU-Accelerated Lossless Compression
SIMD: AVX-512 (Tier 1) ← Your CPU's vector instruction set
GPU: CUDA enabled ← GPU detected and ready
Workers: 14 threads ← Auto-detected worker count (your cores - 2)
- SIMD Tier 1 (AVX-512): Best. Found on AMD Zen 4+ and Intel Ice Lake+.
- SIMD Tier 2 (AVX2): Good. Any modern CPU.
- GPU: CUDA enabled: GPU acceleration is active (~10x faster than CPU-only).
- GPU: Disabled (CPU-only mode): No CUDA found. Works fine, just slower.
- Workers: Number of parallel threads. Auto-detected as
physical_cores - 2.
Compressed: 211957760 -> 52983644 bytes (4.00x ratio)
Speed: 541 MB/s Time: 373 ms Threads: 14
- 211957760 -> 52983644: Original size → compressed size (in bytes)
- 4.00x ratio: Original / compressed = how much smaller. Higher = better.
- 541 MB/s: Compression throughput (original_size / time). Higher = faster.
- Threads: 14: Worker threads used.
Config Compress Decomp Ratio Verify
------ -------- ------ ----- ------
1T 226 MB/s 613 MB/s 4.02x PASS
Par 6MB 541 MB/s 672 MB/s 4.00x PASS
- 1T: Single-thread mode (best ratio, slower). Uses 1 GPU transform + parallel encoding.
- Par 6MB: Parallel mode with 6MB blocks. Uses 14 workers + GPU. Fastest.
- Compress/Decomp: Speed in MB/s. These are algorithm speed (excludes file I/O).
- Ratio: Compression ratio. Higher = better.
- PASS: Round-trip verified (compress → decompress → byte-compare = identical).
./apex compress <input> <output.apex> [flags]| Flag | What It Does | When to Use |
|---|---|---|
| (no flags) | 1T mode: single-thread + GPU, largest blocks | Best ratio. Archival. |
-mt |
Parallel mode: auto threads + GPU, auto block size | Best speed. Use this most of the time. |
--par N |
Parallel with N MB blocks (6, 8, 12, 14, 16, 18, 20) | Specific tuning after running tune. |
-t N |
Use exactly N worker threads | Control CPU usage (e.g., -t 4 on shared server). |
--no-lzp |
Skip preprocessing | +65% compress speed, -0.5% ratio. Speed-critical. |
-v |
Verbose output | See GPU status, pipeline details. |
Examples:
./apex compress data.tar data.apex # 1T mode (best ratio)
./apex compress data.tar data.apex -mt # Parallel (best speed)
./apex compress data.tar data.apex --par 14 # 14MB blocks (for source code)
./apex compress data.tar data.apex -mt -t 8 # Parallel, 8 threads
./apex compress data.tar data.apex -mt --no-lzp # Parallel, skip preprocessing (fastest)
./apex compress data.tar data.apex -v # Verbose (see pipeline info)./apex decompress <input.apex> <output> [-v]Decompression auto-detects everything from the .apex file header. No flags needed.
./apex decompress data.apex restored.tar # Decompress
./apex decompress data.apex restored.tar -v # Verbose./apex tune <input> [-t N]Tests all configs (1T + 7 parallel block sizes), measures speed and ratio, and recommends the best. Run this before benchmarking.
./apex tune mydata.tar # Auto threads
./apex tune mydata.tar -t 8 # Test with 8 threadsOutput: table of all configs with speeds + ratio, then specific recommendations for fastest compress, best ratio, fastest decompress, and best overall.
./apex bench <input> [-v]Tests 8 configurations automatically: 1T + Par 6/8/12/14/16/18/20 MB. For each config:
- Warmup run (initializes GPU, excluded from timing)
- Compress best-of-2 (data pre-loaded in RAM, measures algorithm speed only)
- Decompress best-of-2
- Round-trip verify (
memcmporiginal vs decompressed)
Speed measurement: data is pre-loaded in RAM before timing starts. The timer wraps only the compression/decompression call — no file I/O, no memory allocation. This is the same methodology used by lzbench and other standard benchmark frameworks. The speed you see is pure algorithm throughput.
./apex bench data/silesia.tar # Standard benchmark
./apex bench data/silesia.tar -v # With methodology notesExample output:
Config Compress Decomp Ratio Verify
------ -------- ------ ----- ------
1T 226 MB/s 613 MB/s 4.02x PASS
Par 6MB 541 MB/s 672 MB/s 4.00x PASS
Par 8MB 524 MB/s 654 MB/s 4.01x PASS
Par 12MB 413 MB/s 621 MB/s 4.04x PASS
...
To see wall-clock speed (including file I/O), use time ./apex compress ... or the verify.sh script which shows both.
./apex info <output.apex>Shows block count, compression ratio, format version, original size.
- APEX detects content type (text? binary? JSON? already compressed?)
- Runs in 1T mode: single-thread compression using GPU acceleration
- Creates 1-2 large blocks for maximum context
- Applies preprocessing (removes long-range repeated sequences)
- GPU-accelerated transform (groups similar contexts together)
- Entropy encoding (near-optimal bit-level coding)
- Output is a single
.apexfile with headers, compressed blocks, and checksums
This gives the BEST ratio but is slower than parallel mode.
- Same pipeline, but splits input into N blocks of 6MB each
- 14 worker threads (auto-detected) process blocks in parallel
- Workers share the GPU — while 1 uses GPU, others do CPU encoding in parallel
- A collector thread writes blocks in order
- This gives the BEST speed — typically 2-5x faster than 1T.
Same as -mt but with 14MB blocks instead of auto (6MB). Larger blocks = better ratio, slightly fewer blocks for pipeline overlap.
Skips the preprocessing step. The preprocessor scans for repeated 40+ byte sequences and removes them before transform. Skipping it makes compression ~65% faster but loses ~0.5% ratio. Worth it if speed matters more than that last 0.5%.
The included script downloads the same public datasets used in our benchmarks.
chmod +x download_datasets.sh
# Essential 5 datasets (~8.4 GB on disk):
# Silesia (202 MB) — universal mixed benchmark
# enwik9 (954 MB) — Wikipedia text
# Linux Kernel (1.5 GB) — source code tarball
# Large JSON (1.1 GB) — repetitive structured data
# Human Genome (3.0 GB) — DNA reference genome
./download_datasets.sh
# All 14 datasets + enterprise data (~20 GB)
# Includes: IMDb TSV (2.6 GB), Binance BNB trades (612 MB)
./download_datasets.sh --all
# Check what you have
./download_datasets.sh --listAll datasets are publicly available from their original sources (kernel.org, mattmahoney.net, NCBI, etc.). The script just automates the download + decompression.
./apex bench data/silesia.tarecho "=== Silesia ===" && ./apex bench data/silesia.tar && sleep 10
echo "=== enwik9 ===" && ./apex bench data/enwik9 && sleep 10
echo "=== Linux Kernel ===" && ./apex bench data/realworld/linux-kernel.tar && sleep 10
echo "=== Large JSON ===" && ./apex bench data/realworld/large_json_1gb.json && sleep 10
# Human Genome (3GB) — use compress+decompress to avoid OOM in bench:
echo "=== Human Genome ==="
./apex compress data/realworld/grch38.fna /tmp/g.apex -mt && \
./apex decompress /tmp/g.apex /tmp/g_out && \
cmp data/realworld/grch38.fna /tmp/g_out && echo "PASS" && \
rm -f /tmp/g.apex /tmp/g_out./apex compress myfile.tar test.apex -mt
./apex decompress test.apex test_out.tar
cmp myfile.tar test_out.tar && echo "ROUND-TRIP: PASS"
md5sum myfile.tar test_out.tar # Should show identical hashes
rm -f test.apex test_out.tarThe included verify.sh runs 14 independent checks using only standard Unix tools (stat, md5sum, sha256sum, cmp, date, bc). It does NOT use APEX's self-reported numbers.
What it tests:
- 1T compress works
- Decompress produces correct output
- Lossless: size match + MD5 + SHA256 + byte-level
cmp(4 checks) - Parallel mode round-trip (compress + decompress + verify)
- Custom configs:
--par 8,--par 20,--no-lzpall round-trip correctly - Cross-mode: 1T and Par 14MB both decompress
- Determinism: compressing twice → identical output
# Basic usage
./verify.sh data/silesia.tar
# Specify which binary to test
./verify.sh data/silesia.tar ./apex-gpu-avx512
./verify.sh data/silesia.tar ./apex-cpu-avx2
# Test on multiple datasets
./verify.sh data/enwik9
./verify.sh data/realworld/large_json_1gb.json
./verify.sh data/realworld/linux-kernel.tarOr do it manually without any script:
# Compress and check the compressed file size yourself
./apex compress data/silesia.tar /tmp/test.apex -mt
ls -la data/silesia.tar /tmp/test.apex
# Calculate ratio: 211957760 / compressed_size
# Decompress and check MD5 yourself
./apex decompress /tmp/test.apex /tmp/test_out
md5sum data/silesia.tar /tmp/test_out
# Both hashes MUST be identical
# Time it yourself
time ./apex compress data/silesia.tar /tmp/test.apex -mt
# speed = 202 MB / real_seconds
rm -f /tmp/test.apex /tmp/test_outapex bench measures algorithm speed (data pre-loaded in RAM, excluding file I/O) — the same methodology used by lzbench, Squash, and all standard compression benchmarks. Wall-clock time includes disk read + write, which is slower. Both are valid measurements; they answer different questions.
Laptops throttle GPU/CPU under sustained load. 10 seconds between datasets prevents thermal throttling from affecting subsequent results. Desktops and servers need less cooldown.
APEX auto-detects your physical CPU cores and uses cores - 2 threads (reserves 2 for OS/IO).
# Check what APEX detected
./apex --help | grep Workers
# Check your actual cores
nproc # Total logical threads (includes SMT/HT)
lscpu | grep "Core(s) per socket" # Physical cores per socket
lscpu | grep "Socket(s)" # Number of sockets (usually 1)Override:
./apex compress data.tar out.apex -mt -t 8 # Force 8 threads
./apex tune mydata.tar -t 4 # Tune with 4 threads| Your CPU | Cores | Auto Workers | Override? |
|---|---|---|---|
| Intel i5 / Ryzen 5 (4-6 core) | 4-6 | 2-4 | Usually fine |
| Intel i7 / Ryzen 7 (8 core) | 8 | 6 | Usually fine |
| Intel i9 / Ryzen 9 (16 core) | 16 | 14 | Usually fine |
| Threadripper (32-64 core) | 32-64 | 28-56 | Consider -t 30 (GPU bottleneck) |
| EPYC / Xeon (64-128 core) | 64-128 | 56+ | Use -t 30 (diminishing returns) |
| Laptop on battery | Any | Auto | Consider -t 4 to save power |
| Shared server | Any | Auto | Use -t <your_fair_share> |
Intel hybrid CPUs (12th-14th gen with P+E cores): APEX uses all detected cores. E-cores are slower for compression. For best results, use -t <P-core count - 2>.
APEX uses ~5.2 GB VRAM by default (128MB transform blocks × 2 GPU contexts × 20.5x working memory).
| Your VRAM | Will It Work? | Notes |
|---|---|---|
| 4 GB | Marginal | May fail on large files. Use --par 6. |
| 6 GB | Yes | Occasional pressure on large blocks. |
| 8 GB | Yes (default) | Designed for this. |
| 12-16 GB | Yes | Extra headroom, no benefit from defaults. |
| 24+ GB | Yes | No additional benefit (blocks capped at 128MB). |
If GPU fails, APEX automatically falls back to CPU transform. Same ratio, slower speed.
| File Size | RAM Needed (bench) | RAM Needed (compress) |
|---|---|---|
| 100 MB | ~400 MB | ~200 MB |
| 500 MB | ~2 GB | ~1 GB |
| 1 GB | ~4 GB | ~2 GB |
| 3 GB | ~10 GB | ~5 GB |
The bench command needs ~3x file size (input + compressed + decompressed in RAM simultaneously). If OOM, use compress + decompress separately.
Any SSD is fine. APEX peaks at ~1.7 GB/s throughput (Large JSON) — NVMe Gen3+ handles this easily. HDDs may bottleneck on large files.
Block size is the most important tuning parameter. It controls the trade-off between speed and ratio.
| Block Size | Speed | Ratio | Best For |
|---|---|---|---|
--par 6 |
Fastest | Lowest par ratio | Small/mixed files (<200MB), max throughput |
--par 8 |
Fast | Good | Text files, Wikipedia, books |
--par 12 |
Balanced | Better | Medium source code, tarballs |
--par 14 |
Good | High | Large source code (LLVM, Chromium) |
--par 18 |
Good | Higher | Repetitive data (JSON, CSV, logs) |
--par 20 |
Moderate | Best par | Best ratio in parallel mode |
| (no flags) | Slowest | Best overall | 1T mode, archival, maximum ratio |
Why? Larger blocks give transform more context to find patterns. A 20MB block sees patterns spanning 20MB. A 6MB block only sees 6MB. But larger blocks mean fewer blocks, so less pipeline overlap with the GPU.
Don't guess — measure:
./apex tune mydata.tar # Tests ALL sizes, recommends the best./apex compress data.tar data.apex -mt --no-lzppreprocessing scans for repeated 40+ byte sequences. It costs ~400 MB/s throughput. On data with few long repeats (binary, random-looking), it's pure overhead. --no-lzp skips it.
- Text/source/JSON: Keep preprocessing on (ratio gain is worth it)
- Binary/mixed: Try both with
tune, preprocessing might not help - Speed-critical:
--no-lzpfor +65% compress speed
# Maximum speed: parallel + skip preprocessing + small blocks
./apex compress data.tar data.apex --par 6 --no-lzp
# Maximum ratio: 1T mode (default, no flags needed)
./apex compress data.tar data.apex
# Balanced: parallel with medium blocks
./apex compress data.tar data.apex --par 14
# Controlled: specific threads + blocks
./apex compress data.tar data.apex --par 16 -t 8Our test system: AMD Ryzen 9 8940HX (16C/32T) + NVIDIA RTX 5070 Laptop (8GB) + 16GB DDR5
| Dataset | Size | Compress | Decompress | Ratio | Config |
|---|---|---|---|---|---|
| Silesia (mixed) | 202 MB | 541 MB/s | 672 MB/s | 4.00x | Par 6MB |
| Spark Logs | 2.8 GB | 1,257 MB/s | 1,545 MB/s | 28.35x | Par 16MB |
| Large JSON | 1.1 GB | 1,642 MB/s | 2,022 MB/s | 18.11x | Par 18MB |
| HDFS Logs | 1.5 GB | 994 MB/s | 1,330 MB/s | 16.36x | Par 12MB |
| BGL Logs | 709 MB | 767 MB/s | 1,102 MB/s | 17.32x | Par 12MB |
| Linux Kernel | 1.5 GB | 817 MB/s | 999 MB/s | 9.26x | Par 12MB |
| Binance BNB | 612 MB | 531 MB/s | 682 MB/s | 7.27x | Par 6MB |
| IMDb TSV | 2.6 GB | 583 MB/s | 719 MB/s | 5.36x | Par 6MB |
| enwik9 (text) | 954 MB | 634 MB/s | 697 MB/s | 4.36x | Par 8MB |
| Human Genome | 3.0 GB | 479 MB/s | 757 MB/s | 4.36x | Par 8MB |
| Dataset | Ratio | Compress | Decompress |
|---|---|---|---|
| Spark Logs | 29.16x | 417 MB/s | 1,780 MB/s |
| Large JSON | 23.11x | 540 MB/s | 1,965 MB/s |
| HDFS Logs | 17.79x | 376 MB/s | 1,357 MB/s |
| BGL Logs | 17.03x | 324 MB/s | 1,033 MB/s |
| Linux Kernel | 9.64x | 329 MB/s | 1,201 MB/s |
| Binance BNB | 7.10x | 247 MB/s | 654 MB/s |
| IMDb TSV | 5.53x | 249 MB/s | 860 MB/s |
| enwik9 | 5.04x | 241 MB/s | 642 MB/s |
| Human Genome | 4.48x | 213 MB/s | 828 MB/s |
| Silesia | 4.02x | 226 MB/s | 578 MB/s |
Your numbers will differ based on your GPU, CPU, and RAM. Run ./apex bench and ./apex tune to measure YOUR system.
| Problem | Cause | Fix |
|---|---|---|
GPU: Disabled (CPU-only mode) |
CUDA not found | Install CUDA toolkit, ensure nvcc is in PATH |
| First run is slow | CUDA driver loading (one-time per process) | Normal. Use apex bench for accurate speed. CPU-only binary has no delay. |
Killed or no output after starting |
Not enough RAM (see below) | Use compress+decompress separately |
| GPU memory errors | VRAM < 8 GB | Use --par 6 for smaller GPU transform blocks |
GLIBC_2.38 not found |
Old Linux | Need Ubuntu 24.04+ or Fedora 39+. Or glibc 2.38+. |
| Speeds lower than reference | Different hardware | Expected. Run ./apex tune for YOUR optimal config. |
| Speed drops during long benchmarks | Thermal throttling | Add sleep 10 between datasets. Desktop/server won't have this. |
command not found |
Not executable | chmod +x apex |
No such file or directory for datasets |
Not downloaded | Run ./download_datasets.sh first |
This binary reads the entire input file into RAM before processing. The compression algorithm itself is block-based (6-20 MB blocks) and needs only a few GB of working memory, but the CLI loads the full file upfront. Here's what that means in practice:
| Command | RAM needed | Why |
|---|---|---|
bench |
~3x file size | Holds original + compressed + decompressed simultaneously |
compress |
~1.5x file size | Holds original + compressed output |
decompress |
~1.5x file size | Holds compressed + decompressed output |
| File Size | bench needs |
compress/decompress needs |
|---|---|---|
| 200 MB | ~600 MB | ~300 MB |
| 1 GB | ~3 GB | ~1.5 GB |
| 3 GB | ~9 GB | ~4.5 GB |
| 5 GB+ | ~15 GB+ | ~7.5 GB+ |
For the 5 essential datasets (up to 3 GB), 16 GB RAM is sufficient for all commands. For enterprise datasets over 4 GB, use compress/decompress separately instead of bench, and use --par 6 (smallest blocks = lowest memory overhead).
How to tell if it OOM'd:
# Signs of OOM:
# - "Killed" message
# - Process exits with no output
# - dmesg shows "Out of memory: Killed process"
dmesg | tail -5Fix — use compress + decompress separately:
./apex compress data/realworld/grch38.fna /tmp/test.apex --par 6
./apex decompress /tmp/test.apex /tmp/test_out
cmp data/realworld/grch38.fna /tmp/test_out && echo "PASS"
rm -f /tmp/test.apex /tmp/test_out
# To measure speed externally:
time ./apex compress data/realworld/grch38.fna /tmp/test.apex --par 6
# speed ≈ file_size_MB / real_seconds# Full system check
./apex --help # APEX version, GPU status, workers
nvidia-smi # GPU model, VRAM, driver version
nvcc --version # CUDA toolkit version
nproc # CPU thread count
free -h # Available RAM
lscpu | grep -E "Model name|Core|Socket|Thread" # CPU detailsWhen sharing benchmark results, please include this information so others can compare:
=== Hardware ===
CPU: [model] ([cores]C/[threads]T)
GPU: [model] ([VRAM] GB)
RAM: [total] GB [DDR4/DDR5] [single/dual channel]
OS: [distro] [version], Kernel [version]
CUDA: [version]
=== APEX Info ===
[paste output of: ./apex --help | head -4]
=== Results ===
[paste output of: ./apex bench data/silesia.tar]
=== Round-trip Verification ===
[paste output of: ./apex compress data/silesia.tar /tmp/t.apex -mt && \
./apex decompress /tmp/t.apex /tmp/t_out && \
md5sum data/silesia.tar /tmp/t_out]
=== Hardware ===
CPU: AMD Ryzen 7 7700X (8C/16T)
GPU: NVIDIA RTX 4070 (12 GB)
RAM: 32 GB DDR5-6000 dual channel
OS: Ubuntu 24.04, Kernel 6.8.0
CUDA: 13.2.51
=== APEX Info ===
APEX 0.1.0 — GPU-Accelerated Lossless Compression
SIMD: AVX-512 (Tier 1)
GPU: CUDA enabled
Workers: 6 threads
=== Results ===
Config Compress Decomp Ratio Verify
1T XXX MB/s XXX MB/s 4.02x PASS
Par 6MB XXX MB/s XXX MB/s 4.00x PASS
...
APEX uses libcubwt and libsais — both by Ilya Grebnov. These are exceptional libraries that make high-performance BWT practical.
We also benchmark against libbsc (also by Grebnov) — a BWT compressor using the same underlying libraries. bsc achieves 5-14% better ratio; APEX is 7-18x faster. Full comparison in BENCHMARKS.md.
- APEX is under active development (v0.1.0). While all 21 benchmark datasets pass round-trip verification, there may be edge cases or configurations we haven't encountered yet. If you find an issue, we'd appreciate hearing about it.
- There is one known issue: LLVM 2.4GB fails in 1T mode due to a block boundary bug at ~384MB. All parallel modes work correctly on this file.
- Performance varies by hardware. Our reference numbers are from a specific test system (Ryzen 9 8940HX + RTX 5070 Laptop). Your results will differ based on GPU, CPU, and thermal conditions.
- The binary auto-detects hardware capabilities. If GPU is not available or CUDA is not installed, it falls back to CPU-only mode automatically and will print
GPU: Disabled (CPU-only mode). - Compression is lossless — decompressed output is byte-identical to the original. Every
benchandtunerun verifies this automatically (PASS/FAIL).
For questions, collaboration, or licensing inquiries: ritik135001@gmail.com
APEX Testing License v1.0 — Copyright 2026 Ritik. All rights reserved.
This binary is provided exclusively for testing, benchmarking, and evaluation. No ownership or IP rights are transferred. See LICENSE for full terms.
You MAY: download, run, benchmark, share results with attribution, distribute this package unchanged.
You MAY NOT: reverse-engineer, decompile, use in production or commercial services, resell, wrap in another product/service/API, repackage under a different name, claim credit, or create derivative works.
All rights not explicitly granted are reserved by the author.