Skip to content

Aethel-Systems/LAZ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LAZ (Literal-cast Absolute Zero)

中文

This is not the LAZ/LASzip LiDAR format. This is a standalone general-purpose lossless compressor.

Industrial-Grade "Literal Casting" Lossless Compression Engine

LAZ reconstructs the essence of data storage using a "first principles" approach. Through literal casting and physical bit-width stripping techniques, it achieves a lossless compression algorithm that outperforms traditional algorithms (such as Zstandard and LZ4) in specific industrial data domains (sparse data, structured logs, binary vectors). It differs from general compression schemes that rely on probabilistic statistical models (Huffman/ANS) or sliding window dictionaries (LZ77).


🛠 Core Technical Features

2.1 Literal Casting

The core of the algorithm lies in converting fixed-width (typically 8-bit) data into a variable-length sequence of minimal significant bits.

Standard Parsing Example:

  • Original File Read: Read 8-bit physical data 00110010 from the buffer.
  • Visual and Normative Extraction: Under the current ASCII standard, this byte maps to the character '2', with a hexadecimal value of 0x32.
  • Traditional Redundancy Analysis: The high bits 0x3 in 0x32 serve solely as a type identifier within the ASCII standard (indicating it's a numeric character).
  • Literal Casting Operation: Ignore the high-order type identifier and directly extract its core literal numerical value, 0x2.
  • Binary Dimensionality Reduction Storage: Convert 0x2 into its minimal binary sequence without leading zeros: 10 (occupying 2 bits).

2.2 Dynamic Floor Bandwidth ($B_f$)

The LAZ algorithm introduces a floor bandwidth parameter (default $B_f = 2$). This parameter specifies the minimum physical length of the binary stream after dimensionality reduction.

  • Rule: If the absolute value of the data, when converted to binary, has a length less than $B_f$, it is padded with 0s in the high bits until it equals $B_f$. If the length is $\ge B_f$, it remains as is, without any truncation.
  • Example: When $B_f = 2$, the decimal value 0 converts to binary 0 (length 1), which is forcefully padded to 00 (length 2); the value 6 converts to binary 110 (length 3) and is preserved as is.

2.3 Ambiguity Resolution and Type Predefinition (Type_Profile)

Problem Statement: After literal casting, both ASCII '2' (0x32) and original hexadecimal data 0x02 will be reduced to the binary 10. This leads to ambiguity during decoding when restoring the original values. Solution: Introduce a Type_Profile field in the file header to declare the data semantics environment of the current data block.

  • Profile 0x01 (Numeric ASCII): Declares this data block as numeric ASCII text. Upon extracting 10 (i.e., 0x2), the decoder performs a bitwise OR operation 0x2 | 0x30, restoring it to 0x32 ('2').
  • Profile 0x02 (Hex ASCII): Declares this data block as hexadecimal ASCII text (e.g., log dumps). The decoder, based on the numerical range, restores it to 0x30~0x39 ('0'-'9') or 0x41~0x46 ('A'-'F').
  • Profile 0x00 (Raw Binary): Declares it as a native binary stream. The decoded 10 (0x2) is directly padded to 0x02 (00000010) for output.

3. Semantically Predefined Configurations (Type Profiles)

LAZ incorporates multiple Profiles optimized for industrial scenarios, automatically matching the best compression strategy:

  • Numeric/Hex ASCII: For sensor logs, hexadecimal dumps.
  • Sparse 16/32: Specifically designed for machine words with numerous zero prefixes.
  • Base64 Bridged: For text-encoded binary streams.
  • UTF-8 Mixed: Strips fixed prefix redundancy from multi-byte characters.

4. Industrial-Grade Reliability (Production-Ready)

  • Dual CRC32 Checksum: Independent verification of Payload (for transmission security) and Source (for lossless restoration), ensuring end-to-end data integrity.
  • 64KB Adaptive Chunking: Strictly limits memory usage, supporting streaming processing and random access.
  • Adaptive Escape Mechanism: For data blocks that yield no positive compression gain, automatically falls back to Raw mode, ensuring an ultra-low inflation rate below 0.1%.

💻 Quick Start

Compilation

LAZ adheres to a zero-dependency principle, requiring only a C17 standard-compliant compiler and CMake.

cmake -S . -B build -DLAZ_BUILD_TESTS=ON -DLAZ_BUILD_CLI=ON
cmake --build build
ctest --test-dir build --output-on-failure

Command-Line Usage (lazctl)

# Compress
./lazctl compress input.log output.laz

# Decompress
./lazctl decompress output.laz restored.log

C API Integration

#include "laz/laz.h"

📐 Physical File Format

LAZ strictly follows byte alignment and little-endian conventions. The format layout is as follows:

  1. Global Header (64B): Contains the magic number LAZ, version, dual CRC32, and payload statistics.
  2. Chunk Descriptor Table: Records the Profile, bit-width, and offset for each 64KB chunk.
  3. Bitstream Zone: Tightly packed pointer stream and data stream.

For detailed format specifications, please refer to docs/FORMAT.md.


⚖️ Licensing

This project is licensed under the Apache 2.0 license.

About

**LAZ** reconstructs the essence of data storage using a "first principles" approach. Through **literal casting** and **physical bit-width stripping** techniques

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors