ArithmeticCompressor

Near-Optimal Lossless Data Compression Using Arithmetic Coding in C++

Overview

This project implements a complete lossless data compression system based on arithmetic coding, designed to achieve compression performance close to the theoretical entropy limit. The implementation is written entirely in C++ and focuses on correctness, numerical stability, modular design, and practical system concerns such as streaming and memory usage.

The compressor supports static and adaptive probability models, uses integer-only interval arithmetic with proper renormalization (E1/E2/E3), and provides a streaming mode that allows large files to be compressed and decompressed without loading them fully into memory.

This project was developed as a senior-level academic project and is suitable both as a reference implementation of arithmetic coding and as a foundation for further experimentation in data compression. Read the full documentation in my report: https://docs.google.com/document/d/1F0SG7RKm7Coia5g5S69HAl8L-Gw6oZnUplmYcuqcvFo/edit?usp=sharing

Key Features

Lossless arithmetic coding with bit-exact decoding
Static probability model based on precomputed symbol frequencies
Adaptive probability model with Laplace smoothing and rescaling
Streaming compression and decompression for large files
Integer-only arithmetic (no floating point) for deterministic behavior
E1 / E2 / E3 renormalization with correct underflow handling
Custom binary file format (AC03) with self-contained metadata
Automated correctness testing using deterministic round-trip validation
Evaluation tools for entropy, Huffman coding, runtime, and memory usage

Compression Modes

Static Arithmetic Coding

Symbol frequencies are computed once from the entire input file
Frequency table is stored in the compressed file header
Faster encoding and decoding for stationary data distributions

Adaptive Arithmetic Coding

Symbol frequencies are updated dynamically during encoding and decoding
Uses Laplace smoothing to avoid zero probabilities
No frequency table stored in the compressed file
Better suited for non-stationary or unknown data distributions

Streaming Adaptive Mode

Processes input data incrementally in fixed-size blocks
Writes compressed bits directly to disk
Minimizes peak memory usage
Ideal for large files and data streams

File Format (AC03)

Compressed files use a custom binary format identified by the magic header AC03.

Header fields:

Magic identifier (AC03)
Compression mode (static or adaptive)
Original uncompressed size
Symbol frequency table (static mode only)
Total number of valid compressed bits
Payload size
Bit-packed compressed payload

This format allows independent decoding without external metadata.

Algorithm Highlights

Coding interval mapped to a 32-bit integer range [0, 2^32 - 1]
Interval refinement based on cumulative frequencies
Renormalization using standard arithmetic coding rules:
- E1: lower half
- E2: upper half
- E3: underflow region
Deferred bit handling guarantees correctness and prevents interval collapse
Encoder and decoder remain perfectly synchronized

Implementation Architecture

The system is organized into modular components:

SymbolModel / CumFreq – static frequency computation and cumulative tables
AdaptiveModel – dynamic frequency updates using a Fenwick tree
ArithmeticEncoder / ArithmeticDecoder – core coding logic
BitIO / BitIOStream – bit-level input/output (memory and streaming)
Evaluation modules – entropy, Huffman coding, runtime, and memory analysis

This structure allows easy extension and experimentation with alternative models or I/O strategies.

Testing and Validation

Correctness is verified through automated round-trip testing:

Input data is compressed and then decompressed
Output is compared byte-for-byte with the original input
Tests cover:
- Empty files
- Single-byte inputs
- Low-entropy data
- High-entropy random data (with fixed seeds)
- Large files in streaming mode

All tests are executed using CTest and fail immediately on any mismatch.

Evaluation

The project evaluates compression performance against:

Shannon entropy (theoretical lower bound)
Huffman coding (prefix-based baseline)

Metrics include:

Average bits per symbol
Total compressed size
Compression ratio
Encoding and decoding runtime
Memory usage and peak RSS

Results show that arithmetic coding consistently achieves compression close to entropy and often outperforms Huffman coding, especially for larger inputs.

Build Requirements

C++17-compatible compiler (g++ or clang++)
CMake 3.16 or later
Unix-like environment (Linux or macOS)

Command-Line Interface

The project provides a single executable ac exposing multiple subcommands via a command-line interface.

In local builds, the executable is typically run as:
./build/ac

Usage

ac encode <input_file> <output_file.ac>
ac decode <input_file.ac> <output_file>
ac encode-adaptive <input_file> <output_file.ac>
ac decode-adaptive <input_file.ac> <output_file>
ac encode-adaptive-stream <input_file> <output_file.ac>
ac decode-stream <input_file.ac> <output_file>
ac eval <input_file> <output_report.txt>
ac eval-adaptive <input_file> <output_report.txt>
ac eval-stream <input_file> <output_report.txt>
ac --help

Compression Commands

encode / decode Performs static arithmetic coding using a precomputed frequency table stored in the compressed file.
encode-adaptive / decode-adaptive Performs adaptive arithmetic coding, updating symbol frequencies dynamically during processing.
encode-adaptive-stream / decode-stream Enables streaming adaptive compression and decompression, processing data incrementally to minimize memory usage.

Evaluation Commands

eval Computes entropy, Huffman coding statistics, arithmetic coding efficiency, runtime, and compression ratio for static mode.
eval-adaptive Evaluates compression efficiency and runtime for adaptive arithmetic coding.
eval-stream Evaluates streaming adaptive compression, including memory usage characteristics.

Build and Run

# Clone repository
git clone https://github.com/MonicaMell/ArithmeticCompressor.git
cd ArithmeticCompressor

# Build
mkdir build && cd build
cmake ..
make

After building, run the executable using:

./build/ac <command> [arguments]

Project Goals

Demonstrate a correct and near-optimal implementation of arithmetic coding
Translate information theory into a practical software system
Address real-world concerns such as numerical precision, memory usage, and scalability
Provide a clean, extensible codebase suitable for academic defense and future work

Future Improvements

Possible extensions include:

Larger or non-byte alphabets
Context-based or higher-order probability models
Integration with other compression pipelines
Comparison with modern alternatives such as ANS

Author

Monika Melkonyan Fourth-year Computer Science student French University in Armenia

License

This project is intended for academic and educational use.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArithmeticCompressor

Overview

Key Features

Compression Modes

Static Arithmetic Coding

Adaptive Arithmetic Coding

Streaming Adaptive Mode

File Format (AC03)

Algorithm Highlights

Implementation Architecture

Testing and Validation

Evaluation

Build Requirements

Command-Line Interface

Usage

Compression Commands

Evaluation Commands

Build and Run

Project Goals

Future Improvements

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ArithmeticCompressor

Overview

Key Features

Compression Modes

Static Arithmetic Coding

Adaptive Arithmetic Coding

Streaming Adaptive Mode

File Format (AC03)

Algorithm Highlights

Implementation Architecture

Testing and Validation

Evaluation

Build Requirements

Command-Line Interface

Usage

Compression Commands

Evaluation Commands

Build and Run

Project Goals

Future Improvements

Author

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages