Near-Optimal Lossless Data Compression Using Arithmetic Coding in C++
This project implements a complete lossless data compression system based on arithmetic coding, designed to achieve compression performance close to the theoretical entropy limit. The implementation is written entirely in C++ and focuses on correctness, numerical stability, modular design, and practical system concerns such as streaming and memory usage.
The compressor supports static and adaptive probability models, uses integer-only interval arithmetic with proper renormalization (E1/E2/E3), and provides a streaming mode that allows large files to be compressed and decompressed without loading them fully into memory.
This project was developed as a senior-level academic project and is suitable both as a reference implementation of arithmetic coding and as a foundation for further experimentation in data compression. Read the full documentation in my report: https://docs.google.com/document/d/1F0SG7RKm7Coia5g5S69HAl8L-Gw6oZnUplmYcuqcvFo/edit?usp=sharing
- Lossless arithmetic coding with bit-exact decoding
- Static probability model based on precomputed symbol frequencies
- Adaptive probability model with Laplace smoothing and rescaling
- Streaming compression and decompression for large files
- Integer-only arithmetic (no floating point) for deterministic behavior
- E1 / E2 / E3 renormalization with correct underflow handling
- Custom binary file format (AC03) with self-contained metadata
- Automated correctness testing using deterministic round-trip validation
- Evaluation tools for entropy, Huffman coding, runtime, and memory usage
- Symbol frequencies are computed once from the entire input file
- Frequency table is stored in the compressed file header
- Faster encoding and decoding for stationary data distributions
- Symbol frequencies are updated dynamically during encoding and decoding
- Uses Laplace smoothing to avoid zero probabilities
- No frequency table stored in the compressed file
- Better suited for non-stationary or unknown data distributions
- Processes input data incrementally in fixed-size blocks
- Writes compressed bits directly to disk
- Minimizes peak memory usage
- Ideal for large files and data streams
Compressed files use a custom binary format identified by the magic header AC03.
Header fields:
- Magic identifier (
AC03) - Compression mode (static or adaptive)
- Original uncompressed size
- Symbol frequency table (static mode only)
- Total number of valid compressed bits
- Payload size
- Bit-packed compressed payload
This format allows independent decoding without external metadata.
-
Coding interval mapped to a 32-bit integer range
[0, 2^32 - 1] -
Interval refinement based on cumulative frequencies
-
Renormalization using standard arithmetic coding rules:
- E1: lower half
- E2: upper half
- E3: underflow region
-
Deferred bit handling guarantees correctness and prevents interval collapse
-
Encoder and decoder remain perfectly synchronized
The system is organized into modular components:
- SymbolModel / CumFreq – static frequency computation and cumulative tables
- AdaptiveModel – dynamic frequency updates using a Fenwick tree
- ArithmeticEncoder / ArithmeticDecoder – core coding logic
- BitIO / BitIOStream – bit-level input/output (memory and streaming)
- Evaluation modules – entropy, Huffman coding, runtime, and memory analysis
This structure allows easy extension and experimentation with alternative models or I/O strategies.
Correctness is verified through automated round-trip testing:
-
Input data is compressed and then decompressed
-
Output is compared byte-for-byte with the original input
-
Tests cover:
- Empty files
- Single-byte inputs
- Low-entropy data
- High-entropy random data (with fixed seeds)
- Large files in streaming mode
All tests are executed using CTest and fail immediately on any mismatch.
The project evaluates compression performance against:
- Shannon entropy (theoretical lower bound)
- Huffman coding (prefix-based baseline)
Metrics include:
- Average bits per symbol
- Total compressed size
- Compression ratio
- Encoding and decoding runtime
- Memory usage and peak RSS
Results show that arithmetic coding consistently achieves compression close to entropy and often outperforms Huffman coding, especially for larger inputs.
- C++17-compatible compiler (g++ or clang++)
- CMake 3.16 or later
- Unix-like environment (Linux or macOS)
The project provides a single executable ac exposing multiple subcommands via a command-line interface.
In local builds, the executable is typically run as:
./build/ac
ac encode <input_file> <output_file.ac>
ac decode <input_file.ac> <output_file>
ac encode-adaptive <input_file> <output_file.ac>
ac decode-adaptive <input_file.ac> <output_file>
ac encode-adaptive-stream <input_file> <output_file.ac>
ac decode-stream <input_file.ac> <output_file>
ac eval <input_file> <output_report.txt>
ac eval-adaptive <input_file> <output_report.txt>
ac eval-stream <input_file> <output_report.txt>
ac --help
-
encode/decodePerforms static arithmetic coding using a precomputed frequency table stored in the compressed file. -
encode-adaptive/decode-adaptivePerforms adaptive arithmetic coding, updating symbol frequencies dynamically during processing. -
encode-adaptive-stream/decode-streamEnables streaming adaptive compression and decompression, processing data incrementally to minimize memory usage.
-
evalComputes entropy, Huffman coding statistics, arithmetic coding efficiency, runtime, and compression ratio for static mode. -
eval-adaptiveEvaluates compression efficiency and runtime for adaptive arithmetic coding. -
eval-streamEvaluates streaming adaptive compression, including memory usage characteristics.
# Clone repository
git clone https://github.com/MonicaMell/ArithmeticCompressor.git
cd ArithmeticCompressor
# Build
mkdir build && cd build
cmake ..
makeAfter building, run the executable using:
./build/ac <command> [arguments]- Demonstrate a correct and near-optimal implementation of arithmetic coding
- Translate information theory into a practical software system
- Address real-world concerns such as numerical precision, memory usage, and scalability
- Provide a clean, extensible codebase suitable for academic defense and future work
Possible extensions include:
- Larger or non-byte alphabets
- Context-based or higher-order probability models
- Integration with other compression pipelines
- Comparison with modern alternatives such as ANS
Monika Melkonyan Fourth-year Computer Science student French University in Armenia
This project is intended for academic and educational use.