In honor of my favorite Silicon Valley show I decided to try my hand at a searchable compression algorithm
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Searchable Compression

This is just a little idea I had. Strictly a proof of concept. Naturally, inspired by Silicon Valley

The Concept

The Problem:

Most modern compression relies on a Dictionary Coder which relies on a continuously- updated "dictionary", its coding result is dependent on the history (all codes in the dictionary that is derived from the input data up to the current symbol), so it is not possible to jump into a certain location and start decoding, without first decoding all of the previous data.

My Solution:

This method only works on plain text files. It centers around the idea that spaces are the important feature that will be present in almost every file of text. Spaces separate words, so if we can tell where the spaces/word boundaries are while in zipped format, we can begin to decode the words around them.

The Implementation

I use a standard Huffman compression algorithm. My implementation does not add any unnecessary complexity to the algorithm, but it was done for a homework assignment before I understood any OO best practices. I will clean it up eventually, but until then bear with me.

A Huffman coding tree is created without regard for spaces, they are ignored at first. Then, with the tree generated from all non-space characters, we start the fun.

Since the implementation considers spaces to be the important feature, I wanted to create a sequence so that if seen even in binary, it could only be a space. I chose to use the sequence of all 0's. The game became: Find the maximum number of 0's that can be generated by any two characters next to each other. In order for a space to be unique, it will have to be one greater than that number. I then made the assumption that the maximum 0 sequence will be come character followed by the character encoded by all 0's, the bottom left value on the Huffman tree.

In there is a concise algorithm to find the minimum unique 0 sequence for a space character. It involves concatenating the bit values for a character followed by the bit pattern of the character encoded by all 0's, then shifting bits and finding the maximum divisibility by a power of 2.

Once we have found the maximum coincidental sequence of 0's, the Huffman tree is modified to include the encoded space, appending a 1 to the character that was encoded by all 0's. The space is then places as far down the tree's left side as it needs to be.

With these modifications we can encode search queries and see if their bit patterns, relative to a space/word boundary, appear in the binary of the zip file.