APEX

GPU-Accelerated Lossless Compression — High Ratio at High Speed

Pre-built binaries, benchmark datasets, and verification tools for independent validation and showcasing. Source code is not included — APEX is in active development.

APEX achieves high compression ratios at high throughput — a combination that has traditionally required choosing one or the other. GPU-accelerated for maximum performance, with a dedicated CPU-only binary for systems without NVIDIA GPUs.

Data Type	Ratio	Compress	Decompress	Config
Mixed corpus (Silesia 202MB)	4.00x	541 MB/s	672 MB/s	Par 6MB
Server logs (Spark 2.8GB)	28.35x	1,257 MB/s	1,545 MB/s	Par 16MB
Structured data (JSON 1.1GB)	18.11x	1,642 MB/s	2,022 MB/s	Par 18MB
HPC logs (BGL 709MB)	17.32x	767 MB/s	1,102 MB/s	Par 12MB
Source code (Linux Kernel 1.5GB)	9.26x	817 MB/s	999 MB/s	Par 12MB
Financial tick data (Binance 612MB)	7.27x	531 MB/s	682 MB/s	Par 6MB
Analytics export (IMDb 2.6GB TSV)	5.36x	583 MB/s	719 MB/s	Par 6MB
Genomic data (Human Genome 3GB)	4.36x	479 MB/s	757 MB/s	Par 8MB

Numbers above are from a consumer laptop (RTX 5070 Laptop, 8 GB GDDR7, 16 GB RAM) — not a server or workstation. No per-dataset tuning — out-of-the-box performance. RTX 5090 results: up to 1,899 MB/s compress, 4,403 MB/s decompress. Server-class hardware would be expected to improve further. See BENCHMARKS.md for all 21 datasets across 3 systems.

Tested on 3 systems with different GPUs (RTX 5070, 4090, 5090) and CPUs (Zen 2, Zen 4). Ratios are deterministic — identical across all hardware. Speeds scale with GPU compute and CPU core count.

No GPU? APEX still works. CPU-only mode: 826 MB/s on JSON, 253 MB/s on Linux Kernel, 131 MB/s at 4.0x on Silesia — faster than every BWT compressor without GPU. See CPU-Only Mode.

RAM note: This testing binary reads the full file into memory before processing. bench needs ~3x file size in RAM, compress/decompress need ~1.5x. The compression algorithm itself is block-based and does not require the full file in memory — this is specific to the current testing CLI, not an algorithm constraint. See memory details.

vs zstd on server logs: On enterprise log data (Spark, HDFS, BGL), APEX achieves 21-34% better ratio than zstd at any level while matching or exceeding zstd's compress speed. Tested across zstd levels 9/12/15 with 6-14 threads. No zstd configuration reaches APEX's ratio — BWT captures log template repetition that LZ77 cannot. Full comparison in BENCHMARKS.md.

Full results, 3-system comparison, CPU-only benchmarks, and vs-competition in BENCHMARKS.md.

Testing & Validation

APEX is in active development. This binary is shared for community validation — verify the claims on your own hardware.

Don't want to run an unknown binary on your machine? Completely understandable — running closed-source binaries from the internet requires trust, and we haven't earned that yet. The included verify.sh validates everything using standard Unix tools (md5sum, stat, cmp) without trusting APEX's own output, but that still means running the binary. If you're genuinely interested in testing but don't want to run it on your hardware, I'll provision a cloud GPU instance for you on Vast.ai (or any provider you prefer) at my expense — you launch it, test freely, and tear it down when done. I can also add you to my Vast.ai team so you can provision any instance yourself. From clone to full benchmark results takes ~10 minutes: Quick setup guide. Reach out at ritik135001@gmail.com.

Validated on 3 independent systems

System	GPU	CPU	Best Compress	Best Decompress
Dev machine (laptop)	RTX 5070 Laptop	Ryzen 9 8940HX (Zen 4)	1,642 MB/s	2,022 MB/s
Vast.ai cloud	RTX 4090	EPYC 7D12 (Zen 2)	1,793 MB/s	2,324 MB/s
Vast.ai cloud	RTX 5090	Dual EPYC 7742 (Zen 2)	1,899 MB/s	4,403 MB/s

Ratios match exactly across all 3 systems. All round-trip verified PASS. Full results in BENCHMARKS.md.

Note: Speed depends on both GPU and CPU. See CPU Architecture Effects for details on how Zen 2 vs Zen 4, AVX2 vs AVX-512, and clock speed affect results.

Purpose of this release

This repo serves two purposes — showcasing what APEX can do and letting you verify it independently.

We've published results showing high ratio at high speed — a combination no existing compressor achieves. Rather than asking you to take our word for it, this binary lets you:

Validate — Run the same benchmarks on your hardware and verify the claims
Reproduce — Download the exact datasets we used and reproduce our methodology
Compare — Test against zstd, bzip2, libbsc, or any compressor on the same data
Verify correctness — Every compressed file round-trips to a byte-identical original (PASS/FAIL)
Explore — Test on your own data, find optimal configs for your workload

If you find APEX useful or interesting, reach out at ritik135001@gmail.com.

What's included

File	Size	GPU	CPU	Description
`apex`	16 MB	Yes	AVX2	Default — GPU + any modern CPU
`apex-gpu-avx2`	16 MB	Yes	AVX2	Same as `apex`
`apex-gpu-avx512`	16 MB	Yes	AVX-512	GPU + AVX-512 CPU (Zen 4+, Intel 12th gen+). Faster decompress.
`apex-cpu-avx2`	1.3 MB	No	AVX2	CPU-only — no CUDA needed. Any CPU from 2013+.
`apex-cpu-avx512`	1.3 MB	No	AVX-512	CPU-only — no CUDA needed. Zen 4+, Intel 12th gen+. Fastest CPU-only.
`apex-cpu-sse42`	1.1 MB	No	SSE4.2	CPU-only — oldest CPUs. Sandy Bridge+ (2011+). No AVX needed.
`download_datasets.sh`	15 KB			Downloads benchmark datasets into `data/`
`verify.sh`	7 KB			Independent verification using standard Unix tools
`sysinfo.sh`	3 KB			Prints full system info (CPU, GPU, RAM, CUDA, OS)
`BENCHMARKS.md`				Full results: 21 datasets, 3 systems, CPU-only, vs-competition
`LICENSE`				Testing license

Which binary to use?

For best performance, use the binary that matches your system. The *-avx512 variants are significantly faster on decompress (+77-122%) and slightly faster on compress (~10-15%). Using the wrong variant won't give wrong results — apex (default) always works — but you'll leave performance on the table.

Step 1: Check your system:

# Do you have an NVIDIA GPU?
nvidia-smi
# Shows GPU info → you have a GPU
# "command not found" → no GPU, use apex-cpu-*

# Does your CPU support AVX-512?
grep -c avx512 /proc/cpuinfo
# Number > 0 → YES, use *-avx512 variant for best speed
# 0 → NO, use *-avx2 variant (or default apex)

# Quick one-liner to tell you which binary to use:
if nvidia-smi &>/dev/null; then
  if grep -q avx512 /proc/cpuinfo; then echo "Use: apex-gpu-avx512";
  else echo "Use: apex (default)"; fi
else
  if grep -q avx512 /proc/cpuinfo; then echo "Use: apex-cpu-avx512";
  elif grep -q avx2 /proc/cpuinfo; then echo "Use: apex-cpu-avx2";
  else echo "Use: apex-cpu-sse42"; fi
fi

Step 2: Pick your binary:

Do you have an NVIDIA GPU + CUDA?
├─ YES → Does your CPU support AVX-512?
│        ├─ YES (Zen 4+, Intel 12th+) → apex-gpu-avx512  (fastest)
│        └─ NO  (older CPU)           → apex  (default, always works)
└─ NO  → Does your CPU support AVX2?
         ├─ YES + AVX-512 → apex-cpu-avx512  (fastest CPU-only)
         ├─ YES           → apex-cpu-avx2   (any CPU from 2013+)
         └─ NO  (very old) → apex-cpu-sse42  (Sandy Bridge+ 2011+)

Your System	Best Binary	What it needs
NVIDIA GPU + AVX-512 CPU	`apex-gpu-avx512`	CUDA + AVX-512
NVIDIA GPU + older CPU	`apex` (default)	CUDA + AVX2
No GPU + AVX-512 CPU	`apex-cpu-avx512`	Just AVX-512
No GPU + AVX2 CPU (2013+)	`apex-cpu-avx2`	Just AVX2
No GPU + old CPU (2011+)	`apex-cpu-sse42`	Just SSE4.2

Note on apex-cpu-sse42: This binary is verified to contain zero AVX/AVX2/AVX-512 instructions (confirmed via objdump). Round-trip tested and cross-compatible with all other binaries. However, it has not been tested on actual pre-AVX2 hardware (Sandy Bridge/Ivy Bridge). If you have such hardware and test it, please share your results.

Note: Using apex (default) on an AVX-512 CPU works fine — correct results, good speed. You just won't get the extra decompress boost that apex-gpu-avx512 provides. It never crashes; it just doesn't use the wider instructions.

CPU-Only Mode (No GPU)

No NVIDIA GPU? No CUDA? No problem. Use apex-cpu-avx2 (or apex-cpu-avx512 for Zen 4+):

chmod +x apex-cpu-avx2
./apex-cpu-avx2 --help             # Shows "GPU: Disabled (CPU-only mode)"
./apex-cpu-avx2 bench data/silesia.tar
./apex-cpu-avx2 compress myfile.tar myfile.apex -mt
./apex-cpu-avx2 decompress myfile.apex restored.tar

All binaries produce identical compressed files — same format, same ratios. A file compressed with apex-cpu-avx2 can be decompressed with apex-gpu-avx512 and vice versa.

CPU-only APEX is faster than every BWT compressor (bsc, bzip2, bzip3, LZMA) even without GPU. See CPU-Only benchmarks.

What's NOT included

Source code is not part of this release. APEX is in development and will be released when ready. This binary is provided specifically for the purpose of community validation and independent benchmarking.

New here? Start with Quick Start, then run ./apex tune mydata.tar — it tests all configurations on YOUR data and recommends the best one.

#	Section	What You'll Find
1	System Requirements	What hardware and software you need
2	Setup Checklist	Step-by-step verification before running
3	Quick Start	From download to first benchmark in 30 seconds
4	Understanding the Output	What every line of output means
5	All Commands	Every command and flag with examples
6	Default Behavior	What happens when you run each mode
7	Download Datasets	Get the exact datasets we benchmarked
8	Reproduce Our Benchmarks	Step-by-step to reproduce our published numbers
9	Hardware Configuration	Adapt APEX to your CPU, GPU, and RAM
10	Advanced Tuning	Block size selection, preprocessing, flag combinations
11	Reference Numbers	Our published results (your target to match/beat)
12	Troubleshooting	Common issues and fixes
13	Reporting Results	How to share your findings with the community

Our Test System

All published benchmark numbers were measured on this exact configuration:

Component	Specification
Machine	ASUS TUF Gaming A16 (2025) — laptop
CPU	AMD Ryzen 9 8940HX (Zen 4, 16 cores / 32 threads, 1MB L2/core, 64MB L3, 5.4 GHz boost)
GPU	NVIDIA RTX 5070 Laptop (Blackwell GB206, 8GB GDDR7, 384 GB/s bandwidth)
RAM	16GB DDR5-5200 single-channel (~40 GB/s)
Storage	Samsung 1TB NVMe Gen4 (~6.5 GB/s sequential)
OS	Ubuntu 24.04.4 LTS, Kernel 6.17.0
CUDA	13.2.51, Driver 580.126.09
Power	Plugged in (AC), ASUS Performance mode (max fans), CPU governor: performance

All benchmarks were run plugged in with Performance thermal profile enabled. Battery mode reduces GPU power from ~115W to ~47W, significantly lowering speeds. If you're on a laptop, make sure you're plugged in and in performance mode for comparable results.

Your results will differ based on your hardware. That's the point — we want to see how APEX performs across different systems.

What We Test On

APEX is benchmarked on 21 real-world datasets across multiple domains — standard benchmarks, enterprise production data, and real-world downloads.

Category	Datasets	Why it matters
Standard benchmarks	Silesia (202MB), enwik8 (96MB), enwik9 (954MB)	Industry-standard. Every compressor publishes these. Directly comparable.
Source code	Linux Kernel (1.5GB), LLVM (2.4GB)	Real production codebases. Tests scaling on large repetitive data.
Server logs	Spark (2.8GB), HDFS (1.5GB), BGL (709MB)	Enterprise log pipelines — the data Datadog/Splunk/Elastic ingest daily.
Financial data	Binance BTC (3.7GB), Binance BNB (612MB)	Real exchange tick data. CSV with prices, quantities, timestamps.
Analytics/data lake	IMDb TSV (2.6GB), GH Events JSON (480MB), Large JSON (1.1GB), Wiki SQL, CSV	Database exports, API logs, tabular data.
Genomics	Human Genome GRCh38 (3.0GB)	Real DNA reference genome. BWT is native to this domain.
Incompressible	Firefox (79MB), Taxi Parquet (48MB+659MB)	Already-compressed data. APEX detects and stores RAW at memcpy speed.

All datasets are publicly downloadable. No synthetic or generated data. The included download_datasets.sh fetches the same files we used — you test on exactly what we tested on.

1. System Requirements

Minimum

Requirement	Minimum	How to Check
OS	Linux x86-64 (Ubuntu 22.04+, Fedora 38+, Arch)	`uname -m` should show `x86_64`
CPU	Any x86-64 with AVX2 (~2015 onwards)	`grep avx2 /proc/cpuinfo`
RAM	4 GB (for small files)	`free -h`
Disk	1 GB free (for binary + one dataset)	`df -h .`

For GPU acceleration (recommended)

Requirement	Minimum	How to Check
NVIDIA GPU	Turing or newer (RTX 20xx+, T4+)	`nvidia-smi`
NVIDIA Driver	525+	`nvidia-smi` (top row shows driver version)
CUDA Toolkit	12.0+	`nvcc --version`
VRAM	6 GB+ (8 GB recommended)	`nvidia-smi` (shows memory)

Supported GPUs

Generation	GPUs	Year
Turing	RTX 2060-2080 Ti, GTX 1660 Ti, T4	2018-2019
Ampere	RTX 3060-3090, A100, A10, A30	2020-2021
Lovelace	RTX 4060-4090, L40, L40S	2022-2023
Hopper	H100, H200	2023-2024
Blackwell	RTX 5070-5090, B100, B200	2024-2025

No GPU? APEX still works — it falls back to CPU-only mode automatically. Same compression ratio, just slower speed.

2. Setup Checklist

Quickest way: Run the included sysinfo.sh script — it checks everything at once:

chmod +x sysinfo.sh
./sysinfo.sh

This prints your CPU (model, cores, clock, AVX2/AVX-512), GPU (model, VRAM, driver), RAM, CUDA version, OS, and APEX status. Copy-paste the output when sharing benchmark results.

Or check manually, one by one:

# 1. Check OS
uname -m
# Expected: x86_64

# 2. Check CPU supports AVX2
grep -c avx2 /proc/cpuinfo
# Expected: a number > 0

# 3. Check NVIDIA driver
nvidia-smi
# Expected: shows GPU name, driver version, VRAM
# If "command not found": no NVIDIA driver installed (APEX will use CPU-only mode)

# 4. Check CUDA toolkit
nvcc --version
# Expected: shows CUDA version 12.0+
# If "command not found": install CUDA toolkit (see below)

# 5. Check available RAM
free -h
# Look at "available" column — need at least 2-4 GB free

# 6. Check disk space
df -h .
# Need ~10 GB for datasets + compressed files

Install CUDA (if missing)

# Ubuntu 22.04/24.04:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update && sudo apt install -y cuda-toolkit-13-2
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
nvcc --version  # Should now work

# Fedora:
sudo dnf install cuda-toolkit

# If you don't want to install CUDA:
# Use apex-cpu-avx2 instead — no CUDA needed, full CPU performance

3. Quick Start

With GPU (NVIDIA + CUDA)

# Clone the repo
git clone https://github.com/Rkcr7/apex-testing.git
cd apex-testing

# Make binaries executable
chmod +x apex apex-gpu-avx2 apex-gpu-avx512 apex-cpu-avx2 apex-cpu-avx512 apex-cpu-sse42 verify.sh download_datasets.sh sysinfo.sh

# Print your system info (share this with benchmarks)
./sysinfo.sh

# Check it works
./apex --help
# You should see: APEX version, SIMD tier, GPU status, worker count
# If you have AVX-512: use ./apex-gpu-avx512 instead for best performance

# Download benchmark datasets
./download_datasets.sh

# Benchmark Silesia
./apex bench data/silesia.tar

# Benchmark ALL 5 datasets at once (with cooldown, saves to results.txt)
for f in data/silesia.tar data/enwik9 data/realworld/linux-kernel.tar data/realworld/large_json_1gb.json data/realworld/grch38.fna; do echo ""; echo "=== $f ==="; ./apex bench "$f"; sleep 10; done 2>&1 | tee results.txt

# Compress a file
./apex compress myfile.tar myfile.apex -mt

# Decompress
./apex decompress myfile.apex restored.tar

# Verify it's lossless (byte-perfect)
cmp myfile.tar restored.tar && echo "PASS: Files are identical"

Without GPU (CPU-only)

git clone https://github.com/Rkcr7/apex-testing.git
cd apex-testing
chmod +x apex-cpu-avx2 apex-cpu-avx512 apex-cpu-sse42 verify.sh download_datasets.sh sysinfo.sh

./sysinfo.sh                                  # Check system
./download_datasets.sh                        # Download datasets
./apex-cpu-avx2 bench data/silesia.tar        # Benchmark (CPU-only)
./apex-cpu-avx2 compress myfile.tar myfile.apex -mt
./apex-cpu-avx2 decompress myfile.apex restored.tar
# If your CPU has AVX-512: use apex-cpu-avx512 instead for best speed

4. Understanding the Output

What `--help` shows

APEX 0.1.0 — GPU-Accelerated Lossless Compression
SIMD:    AVX-512 (Tier 1)        ← Your CPU's vector instruction set
GPU:     CUDA enabled             ← GPU detected and ready
Workers: 14 threads               ← Auto-detected worker count (your cores - 2)

SIMD Tier 1 (AVX-512): Best. Found on AMD Zen 4+ and Intel Ice Lake+.
SIMD Tier 2 (AVX2): Good. Any modern CPU.
GPU: CUDA enabled: GPU acceleration is active (~10x faster than CPU-only).
GPU: Disabled (CPU-only mode): No CUDA found. Works fine, just slower.
Workers: Number of parallel threads. Auto-detected as physical_cores - 2.

What compress output shows

Compressed: 211957760 -> 52983644 bytes (4.00x ratio)
Speed:      541 MB/s  Time: 373 ms  Threads: 14

211957760 -> 52983644: Original size → compressed size (in bytes)
4.00x ratio: Original / compressed = how much smaller. Higher = better.
541 MB/s: Compression throughput (original_size / time). Higher = faster.
Threads: 14: Worker threads used.

What bench output shows

Config        Compress    Decomp    Ratio  Verify
------        --------    ------    -----  ------
1T              226 MB/s    613 MB/s   4.02x  PASS
Par 6MB         541 MB/s    672 MB/s   4.00x  PASS

1T: Single-thread mode (best ratio, slower). Uses 1 GPU transform + parallel encoding.
Par 6MB: Parallel mode with 6MB blocks. Uses 14 workers + GPU. Fastest.
Compress/Decomp: Speed in MB/s. These are algorithm speed (excludes file I/O).
Ratio: Compression ratio. Higher = better.
PASS: Round-trip verified (compress → decompress → byte-compare = identical).

5. All Commands Explained

`compress` — Compress a file

./apex compress <input> <output.apex> [flags]

Flag	What It Does	When to Use
(no flags)	1T mode: single-thread + GPU, largest blocks	Best ratio. Archival.
`-mt`	Parallel mode: auto threads + GPU, auto block size	Best speed. Use this most of the time.
`--par N`	Parallel with N MB blocks (6, 8, 12, 14, 16, 18, 20)	Specific tuning after running `tune`.
`-t N`	Use exactly N worker threads	Control CPU usage (e.g., `-t 4` on shared server).
`--no-lzp`	Skip preprocessing	+65% compress speed, -0.5% ratio. Speed-critical.
`-v`	Verbose output	See GPU status, pipeline details.

Examples:

./apex compress data.tar data.apex              # 1T mode (best ratio)
./apex compress data.tar data.apex -mt          # Parallel (best speed)
./apex compress data.tar data.apex --par 14     # 14MB blocks (for source code)
./apex compress data.tar data.apex -mt -t 8     # Parallel, 8 threads
./apex compress data.tar data.apex -mt --no-lzp # Parallel, skip preprocessing (fastest)
./apex compress data.tar data.apex -v           # Verbose (see pipeline info)

`decompress` — Decompress a file

./apex decompress <input.apex> <output> [-v]

Decompression auto-detects everything from the .apex file header. No flags needed.

./apex decompress data.apex restored.tar        # Decompress
./apex decompress data.apex restored.tar -v     # Verbose

`tune` — Find the best config for YOUR data

./apex tune <input> [-t N]

Tests all configs (1T + 7 parallel block sizes), measures speed and ratio, and recommends the best. Run this before benchmarking.

./apex tune mydata.tar                          # Auto threads
./apex tune mydata.tar -t 8                     # Test with 8 threads

Output: table of all configs with speeds + ratio, then specific recommendations for fastest compress, best ratio, fastest decompress, and best overall.

`bench` — Full benchmark (compress + decompress + verify)

./apex bench <input> [-v]

Tests 8 configurations automatically: 1T + Par 6/8/12/14/16/18/20 MB. For each config:

Warmup run (initializes GPU, excluded from timing)
Compress best-of-2 (data pre-loaded in RAM, measures algorithm speed only)
Decompress best-of-2
Round-trip verify (memcmp original vs decompressed)

Speed measurement: data is pre-loaded in RAM before timing starts. The timer wraps only the compression/decompression call — no file I/O, no memory allocation. This is the same methodology used by lzbench and other standard benchmark frameworks. The speed you see is pure algorithm throughput.

./apex bench data/silesia.tar                   # Standard benchmark
./apex bench data/silesia.tar -v                # With methodology notes

Example output:

Config        Compress    Decomp    Ratio  Verify
------        --------    ------    -----  ------
1T              226 MB/s    613 MB/s   4.02x  PASS
Par 6MB         541 MB/s    672 MB/s   4.00x  PASS
Par 8MB         524 MB/s    654 MB/s   4.01x  PASS
Par 12MB        413 MB/s    621 MB/s   4.04x  PASS
...

To see wall-clock speed (including file I/O), use time ./apex compress ... or the verify.sh script which shows both.

`info` — Show file structure

./apex info <output.apex>

Shows block count, compression ratio, format version, original size.

6. Default Behavior

What happens when you run `./apex compress data.tar data.apex` (no flags)?

APEX detects content type (text? binary? JSON? already compressed?)
Runs in 1T mode: single-thread compression using GPU acceleration
Creates 1-2 large blocks for maximum context
Applies preprocessing (removes long-range repeated sequences)
GPU-accelerated transform (groups similar contexts together)
Entropy encoding (near-optimal bit-level coding)
Output is a single .apex file with headers, compressed blocks, and checksums

This gives the BEST ratio but is slower than parallel mode.

What happens with `-mt` flag?

Same pipeline, but splits input into N blocks of 6MB each
14 worker threads (auto-detected) process blocks in parallel
Workers share the GPU — while 1 uses GPU, others do CPU encoding in parallel
A collector thread writes blocks in order
This gives the BEST speed — typically 2-5x faster than 1T.

What happens with `--par 14`?

Same as -mt but with 14MB blocks instead of auto (6MB). Larger blocks = better ratio, slightly fewer blocks for pipeline overlap.

What about `--no-lzp`?

Skips the preprocessing step. The preprocessor scans for repeated 40+ byte sequences and removes them before transform. Skipping it makes compression ~65% faster but loses ~0.5% ratio. Worth it if speed matters more than that last 0.5%.

7. Download Benchmark Datasets

The included script downloads the same public datasets used in our benchmarks.

chmod +x download_datasets.sh

# Essential 5 datasets (~8.4 GB on disk):
#   Silesia (202 MB)       — universal mixed benchmark
#   enwik9 (954 MB)        — Wikipedia text
#   Linux Kernel (1.5 GB)  — source code tarball
#   Large JSON (1.1 GB)    — repetitive structured data
#   Human Genome (3.0 GB)  — DNA reference genome
./download_datasets.sh

# All 14 datasets + enterprise data (~20 GB)
# Includes: IMDb TSV (2.6 GB), Binance BNB trades (612 MB)
./download_datasets.sh --all

# Check what you have
./download_datasets.sh --list

All datasets are publicly available from their original sources (kernel.org, mattmahoney.net, NCBI, etc.). The script just automates the download + decompression.

8. Reproduce Our Benchmarks

Quick: single dataset

./apex bench data/silesia.tar

Full: all 5 essential datasets with cooldown

echo "=== Silesia ===" && ./apex bench data/silesia.tar && sleep 10
echo "=== enwik9 ===" && ./apex bench data/enwik9 && sleep 10
echo "=== Linux Kernel ===" && ./apex bench data/realworld/linux-kernel.tar && sleep 10
echo "=== Large JSON ===" && ./apex bench data/realworld/large_json_1gb.json && sleep 10

# Human Genome (3GB) — use compress+decompress to avoid OOM in bench:
echo "=== Human Genome ==="
./apex compress data/realworld/grch38.fna /tmp/g.apex -mt && \
./apex decompress /tmp/g.apex /tmp/g_out && \
cmp data/realworld/grch38.fna /tmp/g_out && echo "PASS" && \
rm -f /tmp/g.apex /tmp/g_out

Verify lossless (any file)

./apex compress myfile.tar test.apex -mt
./apex decompress test.apex test_out.tar
cmp myfile.tar test_out.tar && echo "ROUND-TRIP: PASS"
md5sum myfile.tar test_out.tar    # Should show identical hashes
rm -f test.apex test_out.tar

Independent verification (don't trust APEX's own numbers)

The included verify.sh runs 14 independent checks using only standard Unix tools (stat, md5sum, sha256sum, cmp, date, bc). It does NOT use APEX's self-reported numbers.

What it tests:

1T compress works
Decompress produces correct output
Lossless: size match + MD5 + SHA256 + byte-level cmp (4 checks)
Parallel mode round-trip (compress + decompress + verify)
Custom configs: --par 8, --par 20, --no-lzp all round-trip correctly
Cross-mode: 1T and Par 14MB both decompress
Determinism: compressing twice → identical output

# Basic usage
./verify.sh data/silesia.tar

# Specify which binary to test
./verify.sh data/silesia.tar ./apex-gpu-avx512
./verify.sh data/silesia.tar ./apex-cpu-avx2

# Test on multiple datasets
./verify.sh data/enwik9
./verify.sh data/realworld/large_json_1gb.json
./verify.sh data/realworld/linux-kernel.tar

Or do it manually without any script:

# Compress and check the compressed file size yourself
./apex compress data/silesia.tar /tmp/test.apex -mt
ls -la data/silesia.tar /tmp/test.apex
# Calculate ratio: 211957760 / compressed_size

# Decompress and check MD5 yourself
./apex decompress /tmp/test.apex /tmp/test_out
md5sum data/silesia.tar /tmp/test_out
# Both hashes MUST be identical

# Time it yourself
time ./apex compress data/silesia.tar /tmp/test.apex -mt
# speed = 202 MB / real_seconds

rm -f /tmp/test.apex /tmp/test_out

Note on speed measurement

apex bench measures algorithm speed (data pre-loaded in RAM, excluding file I/O) — the same methodology used by lzbench, Squash, and all standard compression benchmarks. Wall-clock time includes disk read + write, which is slower. Both are valid measurements; they answer different questions.

Why cooldown?

Laptops throttle GPU/CPU under sustained load. 10 seconds between datasets prevents thermal throttling from affecting subsequent results. Desktops and servers need less cooldown.

9. Hardware Configuration

CPU: Thread count

APEX auto-detects your physical CPU cores and uses cores - 2 threads (reserves 2 for OS/IO).

# Check what APEX detected
./apex --help | grep Workers

# Check your actual cores
nproc                              # Total logical threads (includes SMT/HT)
lscpu | grep "Core(s) per socket"  # Physical cores per socket
lscpu | grep "Socket(s)"           # Number of sockets (usually 1)

Override:

./apex compress data.tar out.apex -mt -t 8      # Force 8 threads
./apex tune mydata.tar -t 4                     # Tune with 4 threads

Your CPU	Cores	Auto Workers	Override?
Intel i5 / Ryzen 5 (4-6 core)	4-6	2-4	Usually fine
Intel i7 / Ryzen 7 (8 core)	8	6	Usually fine
Intel i9 / Ryzen 9 (16 core)	16	14	Usually fine
Threadripper (32-64 core)	32-64	28-56	Consider `-t 30` (GPU bottleneck)
EPYC / Xeon (64-128 core)	64-128	56+	Use `-t 30` (diminishing returns)
Laptop on battery	Any	Auto	Consider `-t 4` to save power
Shared server	Any	Auto	Use `-t <your_fair_share>`

Intel hybrid CPUs (12th-14th gen with P+E cores): APEX uses all detected cores. E-cores are slower for compression. For best results, use -t <P-core count - 2>.

GPU: VRAM

APEX uses ~5.2 GB VRAM by default (128MB transform blocks × 2 GPU contexts × 20.5x working memory).

Your VRAM	Will It Work?	Notes
4 GB	Marginal	May fail on large files. Use `--par 6`.
6 GB	Yes	Occasional pressure on large blocks.
8 GB	Yes (default)	Designed for this.
12-16 GB	Yes	Extra headroom, no benefit from defaults.
24+ GB	Yes	No additional benefit (blocks capped at 128MB).

If GPU fails, APEX automatically falls back to CPU transform. Same ratio, slower speed.

RAM

File Size	RAM Needed (bench)	RAM Needed (compress)
100 MB	~400 MB	~200 MB
500 MB	~2 GB	~1 GB
1 GB	~4 GB	~2 GB
3 GB	~10 GB	~5 GB

The bench command needs ~3x file size (input + compressed + decompressed in RAM simultaneously). If OOM, use compress + decompress separately.

Storage

Any SSD is fine. APEX peaks at ~1.7 GB/s throughput (Large JSON) — NVMe Gen3+ handles this easily. HDDs may bottleneck on large files.

10. Advanced Tuning

Block size selection

Block size is the most important tuning parameter. It controls the trade-off between speed and ratio.

Block Size	Speed	Ratio	Best For
`--par 6`	Fastest	Lowest par ratio	Small/mixed files (<200MB), max throughput
`--par 8`	Fast	Good	Text files, Wikipedia, books
`--par 12`	Balanced	Better	Medium source code, tarballs
`--par 14`	Good	High	Large source code (LLVM, Chromium)
`--par 18`	Good	Higher	Repetitive data (JSON, CSV, logs)
`--par 20`	Moderate	Best par	Best ratio in parallel mode
(no flags)	Slowest	Best overall	1T mode, archival, maximum ratio

Why? Larger blocks give transform more context to find patterns. A 20MB block sees patterns spanning 20MB. A 6MB block only sees 6MB. But larger blocks mean fewer blocks, so less pipeline overlap with the GPU.

Don't guess — measure:

./apex tune mydata.tar    # Tests ALL sizes, recommends the best

Skip preprocessing for speed

./apex compress data.tar data.apex -mt --no-lzp

preprocessing scans for repeated 40+ byte sequences. It costs ~400 MB/s throughput. On data with few long repeats (binary, random-looking), it's pure overhead. --no-lzp skips it.

Text/source/JSON: Keep preprocessing on (ratio gain is worth it)
Binary/mixed: Try both with tune, preprocessing might not help
Speed-critical: --no-lzp for +65% compress speed

Combine flags

# Maximum speed: parallel + skip preprocessing + small blocks
./apex compress data.tar data.apex --par 6 --no-lzp

# Maximum ratio: 1T mode (default, no flags needed)
./apex compress data.tar data.apex

# Balanced: parallel with medium blocks
./apex compress data.tar data.apex --par 14

# Controlled: specific threads + blocks
./apex compress data.tar data.apex --par 16 -t 8

11. Reference Numbers

Our test system: AMD Ryzen 9 8940HX (16C/32T) + NVIDIA RTX 5070 Laptop (8GB) + 16GB DDR5

Speed Mode (Parallel)

Dataset	Size	Compress	Decompress	Ratio	Config
Silesia (mixed)	202 MB	541 MB/s	672 MB/s	4.00x	Par 6MB
Spark Logs	2.8 GB	1,257 MB/s	1,545 MB/s	28.35x	Par 16MB
Large JSON	1.1 GB	1,642 MB/s	2,022 MB/s	18.11x	Par 18MB
HDFS Logs	1.5 GB	994 MB/s	1,330 MB/s	16.36x	Par 12MB
BGL Logs	709 MB	767 MB/s	1,102 MB/s	17.32x	Par 12MB
Linux Kernel	1.5 GB	817 MB/s	999 MB/s	9.26x	Par 12MB
Binance BNB	612 MB	531 MB/s	682 MB/s	7.27x	Par 6MB
IMDb TSV	2.6 GB	583 MB/s	719 MB/s	5.36x	Par 6MB
enwik9 (text)	954 MB	634 MB/s	697 MB/s	4.36x	Par 8MB
Human Genome	3.0 GB	479 MB/s	757 MB/s	4.36x	Par 8MB

Ratio Mode (1T)

Dataset	Ratio	Compress	Decompress
Spark Logs	29.16x	417 MB/s	1,780 MB/s
Large JSON	23.11x	540 MB/s	1,965 MB/s
HDFS Logs	17.79x	376 MB/s	1,357 MB/s
BGL Logs	17.03x	324 MB/s	1,033 MB/s
Linux Kernel	9.64x	329 MB/s	1,201 MB/s
Binance BNB	7.10x	247 MB/s	654 MB/s
IMDb TSV	5.53x	249 MB/s	860 MB/s
enwik9	5.04x	241 MB/s	642 MB/s
Human Genome	4.48x	213 MB/s	828 MB/s
Silesia	4.02x	226 MB/s	578 MB/s

Your numbers will differ based on your GPU, CPU, and RAM. Run ./apex bench and ./apex tune to measure YOUR system.

12. Troubleshooting

Problem	Cause	Fix
`GPU: Disabled (CPU-only mode)`	CUDA not found	Install CUDA toolkit, ensure `nvcc` is in PATH
First run is slow	CUDA driver loading (one-time per process)	Normal. Use `apex bench` for accurate speed. CPU-only binary has no delay.
`Killed` or no output after starting	Not enough RAM (see below)	Use compress+decompress separately
GPU memory errors	VRAM < 8 GB	Use `--par 6` for smaller GPU transform blocks
`GLIBC_2.38 not found`	Old Linux	Need Ubuntu 24.04+ or Fedora 39+. Or glibc 2.38+.
Speeds lower than reference	Different hardware	Expected. Run `./apex tune` for YOUR optimal config.
Speed drops during long benchmarks	Thermal throttling	Add `sleep 10` between datasets. Desktop/server won't have this.
`command not found`	Not executable	`chmod +x apex`
`No such file or directory` for datasets	Not downloaded	Run `./download_datasets.sh` first

How each command uses memory

This binary reads the entire input file into RAM before processing. The compression algorithm itself is block-based (6-20 MB blocks) and needs only a few GB of working memory, but the CLI loads the full file upfront. Here's what that means in practice:

Command	RAM needed	Why
`bench`	~3x file size	Holds original + compressed + decompressed simultaneously
`compress`	~1.5x file size	Holds original + compressed output
`decompress`	~1.5x file size	Holds compressed + decompressed output

File Size	`bench` needs	`compress`/`decompress` needs
200 MB	~600 MB	~300 MB
1 GB	~3 GB	~1.5 GB
3 GB	~9 GB	~4.5 GB
5 GB+	~15 GB+	~7.5 GB+

For the 5 essential datasets (up to 3 GB), 16 GB RAM is sufficient for all commands. For enterprise datasets over 4 GB, use compress/decompress separately instead of bench, and use --par 6 (smallest blocks = lowest memory overhead).

How to tell if it OOM'd:

# Signs of OOM:
# - "Killed" message
# - Process exits with no output
# - dmesg shows "Out of memory: Killed process"
dmesg | tail -5

Fix — use compress + decompress separately:

./apex compress data/realworld/grch38.fna /tmp/test.apex --par 6
./apex decompress /tmp/test.apex /tmp/test_out
cmp data/realworld/grch38.fna /tmp/test_out && echo "PASS"
rm -f /tmp/test.apex /tmp/test_out

# To measure speed externally:
time ./apex compress data/realworld/grch38.fna /tmp/test.apex --par 6
# speed ≈ file_size_MB / real_seconds

Diagnostic commands

# Full system check
./apex --help                                    # APEX version, GPU status, workers
nvidia-smi                                       # GPU model, VRAM, driver version
nvcc --version                                   # CUDA toolkit version
nproc                                            # CPU thread count
free -h                                          # Available RAM
lscpu | grep -E "Model name|Core|Socket|Thread"  # CPU details

13. Reporting Your Results

When sharing benchmark results, please include this information so others can compare:

Template

=== Hardware ===
CPU:  [model] ([cores]C/[threads]T)
GPU:  [model] ([VRAM] GB)
RAM:  [total] GB [DDR4/DDR5] [single/dual channel]
OS:   [distro] [version], Kernel [version]
CUDA: [version]

=== APEX Info ===
[paste output of: ./apex --help | head -4]

=== Results ===
[paste output of: ./apex bench data/silesia.tar]

=== Round-trip Verification ===
[paste output of: ./apex compress data/silesia.tar /tmp/t.apex -mt && \
  ./apex decompress /tmp/t.apex /tmp/t_out && \
  md5sum data/silesia.tar /tmp/t_out]

Example

=== Hardware ===
CPU:  AMD Ryzen 7 7700X (8C/16T)
GPU:  NVIDIA RTX 4070 (12 GB)
RAM:  32 GB DDR5-6000 dual channel
OS:   Ubuntu 24.04, Kernel 6.8.0
CUDA: 13.2.51

=== APEX Info ===
APEX 0.1.0 — GPU-Accelerated Lossless Compression
SIMD:    AVX-512 (Tier 1)
GPU:     CUDA enabled
Workers: 6 threads

=== Results ===
Config        Compress    Decomp    Ratio  Verify
1T              XXX MB/s    XXX MB/s   4.02x  PASS
Par 6MB         XXX MB/s    XXX MB/s   4.00x  PASS
...

Acknowledgments

APEX uses libcubwt and libsais — both by Ilya Grebnov. These are exceptional libraries that make high-performance BWT practical.

We also benchmark against libbsc (also by Grebnov) — a BWT compressor using the same underlying libraries. bsc achieves 5-14% better ratio; APEX is 7-18x faster. Full comparison in BENCHMARKS.md.

Disclaimers

APEX is under active development (v0.1.0). While all 21 benchmark datasets pass round-trip verification, there may be edge cases or configurations we haven't encountered yet. If you find an issue, we'd appreciate hearing about it.
There is one known issue: LLVM 2.4GB fails in 1T mode due to a block boundary bug at ~384MB. All parallel modes work correctly on this file.
Performance varies by hardware. Our reference numbers are from a specific test system (Ryzen 9 8940HX + RTX 5070 Laptop). Your results will differ based on GPU, CPU, and thermal conditions.
The binary auto-detects hardware capabilities. If GPU is not available or CUDA is not installed, it falls back to CPU-only mode automatically and will print GPU: Disabled (CPU-only mode).
Compression is lossless — decompressed output is byte-identical to the original. Every bench and tune run verifies this automatically (PASS/FAIL).

Contact

For questions, collaboration, or licensing inquiries: ritik135001@gmail.com

License

This binary is provided exclusively for testing, benchmarking, and evaluation. No ownership or IP rights are transferred. See LICENSE for full terms.

You MAY: download, run, benchmark, share results with attribution, distribute this package unchanged.

You MAY NOT: reverse-engineer, decompile, use in production or commercial services, resell, wrap in another product/service/API, repackage under a different name, claim credit, or create derivative works.

All rights not explicitly granted are reserved by the author.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
BENCHMARKS.md		BENCHMARKS.md
LICENSE		LICENSE
README.md		README.md
apex		apex
apex-cpu-avx2		apex-cpu-avx2
apex-cpu-avx512		apex-cpu-avx512
apex-cpu-sse42		apex-cpu-sse42
apex-gpu-avx2		apex-gpu-avx2
apex-gpu-avx512		apex-gpu-avx512
download_datasets.sh		download_datasets.sh
sysinfo.sh		sysinfo.sh
verify.sh		verify.sh

Folders and files

Latest commit

History

Repository files navigation

APEX

Testing & Validation

Validated on 3 independent systems

Purpose of this release

What's included

Which binary to use?

CPU-Only Mode (No GPU)

What's NOT included

Table of Contents

Our Test System

What We Test On

1. System Requirements

Minimum

For GPU acceleration (recommended)

Supported GPUs

2. Setup Checklist

Install CUDA (if missing)

3. Quick Start

With GPU (NVIDIA + CUDA)

Without GPU (CPU-only)

4. Understanding the Output

What --help shows

What compress output shows

What bench output shows

5. All Commands Explained

compress — Compress a file

decompress — Decompress a file

tune — Find the best config for YOUR data

bench — Full benchmark (compress + decompress + verify)

info — Show file structure

6. Default Behavior

What happens when you run ./apex compress data.tar data.apex (no flags)?

What happens with -mt flag?

What happens with --par 14?

What about --no-lzp?

7. Download Benchmark Datasets

8. Reproduce Our Benchmarks

Quick: single dataset

Full: all 5 essential datasets with cooldown

Verify lossless (any file)

Independent verification (don't trust APEX's own numbers)

Note on speed measurement

Why cooldown?

9. Hardware Configuration

CPU: Thread count

GPU: VRAM

RAM

Storage

10. Advanced Tuning

Block size selection

Skip preprocessing for speed

Combine flags

11. Reference Numbers

Speed Mode (Parallel)

Ratio Mode (1T)

12. Troubleshooting

How each command uses memory

Diagnostic commands

13. Reporting Your Results

Template

Example

Acknowledgments

Disclaimers

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

What `--help` shows

`compress` — Compress a file

`decompress` — Decompress a file

`tune` — Find the best config for YOUR data

`bench` — Full benchmark (compress + decompress + verify)

`info` — Show file structure

What happens when you run `./apex compress data.tar data.apex` (no flags)?

What happens with `-mt` flag?

What happens with `--par 14`?

What about `--no-lzp`?

Packages