Program that compresses and decompresses text files with the Huffmann’s method.
- General Info
- Data Structures
- Space requirements and time target analysis
- General assumptions
- External specification - setup
- External specification - usage
- Internal specification
- Internal code description
- Testing
- Deficiencies according to my assessment
- Sources for acquiring knowledge
Compression of the data has been something that revolutionized digital media. Compression algorithms are used to reduce the size of a file while ensuring that files can be fully restored to their original state. There is a very wide range of algorithms that can be used to perform lossless data compression. The simplest known data compression algorithm is the Run-Length Encoding (RLE). Unfortunately, it has some drawbacks (quite significant ones) - there are no repeated characters, and the size of output data can be two times more than the size of input data. To eliminate this problem Huffman Coding was used. Huffman coding algorithm was introduced by David Huffman in 1952. The main purpose of information theory is to convey information in as few bits as possible. Huffman's coding has the same aim and is largely successful. The algorithm stands on creating a tree in the form of a leaf node and its children which has a probability of the frequent appearance of the characters. So, Huffman algorithm has more efficiency in file compression with greater compression ratio. In the era of the internet, the transmission of data depends on time. More data takes more time to transmit. Huffman coding reduces the data size thus data transmission time is significantly reduced. It is the basis of many data compression and encoding schemes. mainly because it is a lossless data conversion. This Algorithm is used for compressing / decompressing data - reducing its size without losing any information. It consists on two major steps:
- Generating a Huffman tree from the input characters
- Traversing the Huffman tree to assign codes to the characters (encoding) For decompression, the Huffman tree must be reconstructed.
So the final objectives are:
- Compressing data without losing data
- Statistical coding - more frequently used symbols should have shorter code words
- Information transmission in fewest possible bits
In my project implementation I have used the priority queue and the binary tree as data structures. Binary tree is a tree data structure in which each node has at most two children, which are referred to as the left child and the right child. A priority queue is an abstract data-type in which each element additionally has a "priority" associated with it. In a priority queue, an element with high priority is served before an element with low priority.
Space requirement:
O (n) The algorithm creates a tree with a maximum of 2n - 1 elements.
Time requirement:
O (n log n) The time requirement for adding to the stack is O (log n) and is done n times in the worst case.
The program gets the data to be compressed or decompressed into the stdin and outputs the corresponding data to the stdout.
The compiler used during creating the project: clang 14.0.6
language: c++
language standard: 13
My script implementation is a command line program. For the purpose of preparing my solution, I have learned how to use ‘make’ so it is incipiently included. Therefore, to generate the “huff” executable file you must use the command:
$ make
Program gleans two different options:
-
--encode
-
--decode
Furthermore it requires names of input and output files.
The input file declaration stands without a parameter and output file is preceded by a “-o” flag, eg:
$ ./huff –encode input_to_encode.txt -o encoded_output.txt
$ ./huff –decode encoded_input.txt -o decoded_output.txt
Program called without any parameters prints a short manual:
Program called with incorrect input file name prints an error message:
The program is implemented with a structural paradigm. User interface is separated from the program's logic.
The program accepts the input file in the manner defined above. Calculates the frequency of occurrences. On this basis creates a binary tree - ”huffman tree” - and huffman code from these tree(traverse the tree), encodes the tree data, after rewriting the bits to bytes, writes the result to the output file. When decoding, the previously encoded file is opened - the bytes are converted into bits, the binary tree is regenerated on the basis of which the date is decoded. Restoring the compressed data to its original form is based on a data storage algorithm.
The program has been tested with various types of files. An empty input file does not cause failure - an empty output file is created. 1 of the sample test provided is included into solution directory in img folder. However, the script did not encounter any problems with slightly larger files.
- obtaining problems with large (or at least very large) files
- could be more responsive to error situations
- can reserve a significant amount of memory