LZ77 (1977 by Ziv and Lempel) compression (Sliding Window Compression)
* Uses previously seen text as a dictionary
* It replaces phrase in the input text with pointers into the dictionary to achieve compression.

+-----------------------------+-------------------+<br>
|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Window&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Look Ahead Buffer |<br>+-----------------------------+-------------------+<br>

* text window: recently encoded text
* look ahead buffer: those to be encoded
* Token (offset, length, symbol):
    * an offset to a phrase in the text window
    * the length of the phrase
    * the first symbol in the look-ahead buffer that follows the phrase.
* A token defines a phrase of variable length in the current look ahead buffer

In [1]:
def longest_mutch(window, look_ahead):
    """
    find the longest look_ahead prefix in window,
    with the shortest offset.
    Input: window - the string to scan for maching inside it
           look-ahead - the prefix of that string to search for.
    Output: (offset, len)
    """
    mutch = None
    maxlen = 0

    for i in range(len(window)):
        length = 0
        j = i
        for k in range(0, len(look_ahead)):
            if j < len(window) and window[j] == look_ahead[k]:
                length += 1
                j += 1
            else:
                break
        
        if length > 0:
            if length >= maxlen:
                mutch = (i, length)
                maxlen = length
    
    return mutch

In [2]:
def lz77_v0_encode(T):
    """
    LZ77 version 0 encoding - no windows size limit
    return list of tokens
    token looks like (offset, length, simbol)
    """
    textlen = len(T)
    simbol_set = set()
    tokens = []

    p = 0 # partition
    while p < textlen:
        window = T[:p]
        look_ahead = T[p:]

        # new character
        if T[p] not in simbol_set:
            tokens.append((0, 0, T[p]))
            simbol_set.add(T[p])
            p = p + 1
        
        else:
            # search fo longet mutch of look_ahead in window.
            start, length = longest_mutch(window, look_ahead)
            if length < len(look_ahead):
                tokens.append((p-start, length, look_ahead[length]))
            else:
                tokens.append((p-start, length, ''))
            p = p + length + 1
    
    return tokens

In [3]:
def lz77_v0_decode(tokens):
    """
    Input: list of tokens [(offset, length, simbol)]
    Output: the deocded text
    """
    T = ""
    
    p = 0
    for f, l, c in tokens:
        T += T[p-f:p-f+l]
        T += c
        p = p+l+1
    
    return T

In [4]:
# LZ77 example PDF7 slide 9:
T = "A_walrus_in_Spain_is_a_walrus_in_vain."
lz77_tokens = lz77_v0_encode(T)
print("Text: {}".format(T))
print("Encode: {}".format(lz77_tokens))
print("Decode: {}".format(lz77_v0_decode(lz77_tokens)))

Text: A_walrus_in_Spain_is_a_walrus_in_vain.
Encode: [(0, 0, 'A'), (0, 0, '_'), (0, 0, 'w'), (0, 0, 'a'), (0, 0, 'l'), (0, 0, 'r'), (0, 0, 'u'), (0, 0, 's'), (7, 1, 'i'), (0, 0, 'n'), (3, 1, 'S'), (0, 0, 'p'), (11, 1, 'i'), (6, 2, 'i'), (12, 2, 'a'), (21, 11, 'v'), (20, 3, '.')]
Decode: A_walrus_in_Spain_is_a_walrus_in_vain.


In [5]:
# LZ77 example from class:
T = "a" * 100
lz77_tokens = lz77_v0_encode(T)
print("Text: {}".format(T))
print("Encode: {}".format(lz77_tokens))
print("Decode: {}".format(lz77_v0_decode(lz77_tokens)))

Text: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Encode: [(0, 0, 'a'), (1, 1, 'a'), (3, 3, 'a'), (7, 7, 'a'), (15, 15, 'a'), (31, 31, 'a'), (37, 37, '')]
Decode: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
