Skip to content

This repository contains a number of different lossless text compression algorithms implemented in Python.

Notifications You must be signed in to change notification settings

Talk2Levi/Lossless-Data-Compression-Zoo

Repository files navigation

Lossless-Data-Compression-Zoo HitCount

This repository contains a number of different lossless text compression algorithms implemented in Python.

Dataset

The dataset was trasformed using Burrows-Wheeler text Transform (BWT) before passing to the data compression algorithm which offers a better input locality and time-and-space performances. (link for the BWT library)

Algorithm

  • Move to Front(MTF): Moves the requested item to the front.

    • The Julia version of this algorithm is here.
  • Timestamp(TS): Inserts an accessed item x in front of the first item y (from the front of the list) that precedes x in the list and was accessed at most once since the last access to x. If there is no such item y or x is accessed for the first time, Timestamp does not move the item.

  • Move by Bit(MBB): Each item has a bit associated with it. At the beginning, all bits are 0. After an access to an item x, MBB moves it to the front if the bit of x is 1; otherwise, it keeps x at its position. In addition, after each access, the bit of the accessed item is flipped.

  • Move to Front Random(MTFR): A variant of MTF. Instead of moving the item to the first position (index 0) in the list, it moves it to the index based on a randomly generated number which is in the range of [0, current index]. The seed that is used by the random generator is encoded in the file in order to decompress. In this project, we use 1 as the random seed.

  • Move to Front Reverse: Upon accessing an item at index i, reverse the ordering of the first i items in the list (the accessed item will be moved to the front)

  • Move to Front Reverse 2: The improved version of move to front reverse. This improved algorithm reversing all the items in front of the one being accessed, it only reverses a small chunk of items. e.g. if the chunk size is defined as 10 and the index being accessed is 90, then the algorithm reverses the item in range[80, 90] (The chunk size was default set to be 10 in the algorithm).

List Update Problem

A classical algorithm in the context of Self-adjusting data structures. Given a set of items in a list where the cost of accessing an item is proportional to its distance from the head of the list.

Data Compression

One important application of list update is in data compression. Given a data sequence, we want to do lossless compression, i.e. we should be able to recover the exact text from the compressed one.

Algorithm Performance

The best performed data compression algorithm for the most of testing input is the simplest one which MTF in red, i.e. the algorithm with the lowest competitive ratio, and a few runner-up algorithms are in brown.

Contributing contributions welcome All Contributors

The project was for the final case study of a Graduated Level Course: Online Algorithm, and the main contributors were (names in alphabetical order):


Levi Guo

Minghao Li

Ye Yuan

About

This repository contains a number of different lossless text compression algorithms implemented in Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages