Skip to content

MonicaMell/ArithmeticCompressor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArithmeticCompressor

Near-Optimal Lossless Data Compression Using Arithmetic Coding in C++

Overview

This project implements a complete lossless data compression system based on arithmetic coding, designed to achieve compression performance close to the theoretical entropy limit. The implementation is written entirely in C++ and focuses on correctness, numerical stability, modular design, and practical system concerns such as streaming and memory usage.

The compressor supports static and adaptive probability models, uses integer-only interval arithmetic with proper renormalization (E1/E2/E3), and provides a streaming mode that allows large files to be compressed and decompressed without loading them fully into memory.

This project was developed as a senior-level academic project and is suitable both as a reference implementation of arithmetic coding and as a foundation for further experimentation in data compression. Read the full documentation in my report: https://docs.google.com/document/d/1F0SG7RKm7Coia5g5S69HAl8L-Gw6oZnUplmYcuqcvFo/edit?usp=sharing


Key Features

  • Lossless arithmetic coding with bit-exact decoding
  • Static probability model based on precomputed symbol frequencies
  • Adaptive probability model with Laplace smoothing and rescaling
  • Streaming compression and decompression for large files
  • Integer-only arithmetic (no floating point) for deterministic behavior
  • E1 / E2 / E3 renormalization with correct underflow handling
  • Custom binary file format (AC03) with self-contained metadata
  • Automated correctness testing using deterministic round-trip validation
  • Evaluation tools for entropy, Huffman coding, runtime, and memory usage

Compression Modes

Static Arithmetic Coding

  • Symbol frequencies are computed once from the entire input file
  • Frequency table is stored in the compressed file header
  • Faster encoding and decoding for stationary data distributions

Adaptive Arithmetic Coding

  • Symbol frequencies are updated dynamically during encoding and decoding
  • Uses Laplace smoothing to avoid zero probabilities
  • No frequency table stored in the compressed file
  • Better suited for non-stationary or unknown data distributions

Streaming Adaptive Mode

  • Processes input data incrementally in fixed-size blocks
  • Writes compressed bits directly to disk
  • Minimizes peak memory usage
  • Ideal for large files and data streams

File Format (AC03)

Compressed files use a custom binary format identified by the magic header AC03.

Header fields:

  • Magic identifier (AC03)
  • Compression mode (static or adaptive)
  • Original uncompressed size
  • Symbol frequency table (static mode only)
  • Total number of valid compressed bits
  • Payload size
  • Bit-packed compressed payload

This format allows independent decoding without external metadata.


Algorithm Highlights

  • Coding interval mapped to a 32-bit integer range [0, 2^32 - 1]

  • Interval refinement based on cumulative frequencies

  • Renormalization using standard arithmetic coding rules:

    • E1: lower half
    • E2: upper half
    • E3: underflow region
  • Deferred bit handling guarantees correctness and prevents interval collapse

  • Encoder and decoder remain perfectly synchronized


Implementation Architecture

The system is organized into modular components:

  • SymbolModel / CumFreq – static frequency computation and cumulative tables
  • AdaptiveModel – dynamic frequency updates using a Fenwick tree
  • ArithmeticEncoder / ArithmeticDecoder – core coding logic
  • BitIO / BitIOStream – bit-level input/output (memory and streaming)
  • Evaluation modules – entropy, Huffman coding, runtime, and memory analysis

This structure allows easy extension and experimentation with alternative models or I/O strategies.


Testing and Validation

Correctness is verified through automated round-trip testing:

  • Input data is compressed and then decompressed

  • Output is compared byte-for-byte with the original input

  • Tests cover:

    • Empty files
    • Single-byte inputs
    • Low-entropy data
    • High-entropy random data (with fixed seeds)
    • Large files in streaming mode

All tests are executed using CTest and fail immediately on any mismatch.


Evaluation

The project evaluates compression performance against:

  • Shannon entropy (theoretical lower bound)
  • Huffman coding (prefix-based baseline)

Metrics include:

  • Average bits per symbol
  • Total compressed size
  • Compression ratio
  • Encoding and decoding runtime
  • Memory usage and peak RSS

Results show that arithmetic coding consistently achieves compression close to entropy and often outperforms Huffman coding, especially for larger inputs.


Build Requirements

  • C++17-compatible compiler (g++ or clang++)
  • CMake 3.16 or later
  • Unix-like environment (Linux or macOS)

Command-Line Interface

The project provides a single executable ac exposing multiple subcommands via a command-line interface.

In local builds, the executable is typically run as:

./build/ac

Usage

ac encode <input_file> <output_file.ac>
ac decode <input_file.ac> <output_file>
ac encode-adaptive <input_file> <output_file.ac>
ac decode-adaptive <input_file.ac> <output_file>
ac encode-adaptive-stream <input_file> <output_file.ac>
ac decode-stream <input_file.ac> <output_file>
ac eval <input_file> <output_report.txt>
ac eval-adaptive <input_file> <output_report.txt>
ac eval-stream <input_file> <output_report.txt>
ac --help

Compression Commands

  • encode / decode Performs static arithmetic coding using a precomputed frequency table stored in the compressed file.

  • encode-adaptive / decode-adaptive Performs adaptive arithmetic coding, updating symbol frequencies dynamically during processing.

  • encode-adaptive-stream / decode-stream Enables streaming adaptive compression and decompression, processing data incrementally to minimize memory usage.

Evaluation Commands

  • eval Computes entropy, Huffman coding statistics, arithmetic coding efficiency, runtime, and compression ratio for static mode.

  • eval-adaptive Evaluates compression efficiency and runtime for adaptive arithmetic coding.

  • eval-stream Evaluates streaming adaptive compression, including memory usage characteristics.


Build and Run

# Clone repository
git clone https://github.com/MonicaMell/ArithmeticCompressor.git
cd ArithmeticCompressor

# Build
mkdir build && cd build
cmake ..
make

After building, run the executable using:

./build/ac <command> [arguments]

Project Goals

  • Demonstrate a correct and near-optimal implementation of arithmetic coding
  • Translate information theory into a practical software system
  • Address real-world concerns such as numerical precision, memory usage, and scalability
  • Provide a clean, extensible codebase suitable for academic defense and future work

Future Improvements

Possible extensions include:

  • Larger or non-byte alphabets
  • Context-based or higher-order probability models
  • Integration with other compression pipelines
  • Comparison with modern alternatives such as ANS

Author

Monika Melkonyan Fourth-year Computer Science student French University in Armenia


License

This project is intended for academic and educational use.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors