# LZ78 Usage Tutorial: Compression

**Note**: please look at `Sequences.ipynb` first if you haven't already.

## Prerequisites
1. Follow the setup instructions in `tutorials/README.md`
2. In the same Python environment as you used for that tutorial, run `pip install ipykernel`
3. Use that Python environment as the kernel for this notebook.

## Important Note
Sometimes, Jupyter doesn't register that a cell containing code from the `lz78` library has started running, so it seems like the cell is waiting to run until it finishes.
This can be annoying for operations that take a while to run, and **can be remedied by putting `stdout.flush()` at the beginning of the cell**.

## Imports

In [None]:
from lz78 import Sequence, LZ78Encoder, CharacterMap, BlockLZ78Encoder
from lz78 import encoded_sequence_from_bytes
import os
import lorem
from sys import stdout

## LZ78 Compression
The `LZ78Encoder` object performs plain LZ78 encoding and decoding, as described in "Compression of individual sequences via variable-rate coding" (Ziv, Lempel 1978).

### 1. `CompressedSequence` object
A `CompressedSequence` object stores an encoded bitstream, as well as some auxiliary information needed for decoding.
`CompressedSequence` objects cannot be instantiated directly,
but rather are returned by `LZ78Encoder.encode`.

The main functionality is:
1. Getting the compression ratio as `(encoded size) / (uncompressed len * log A)`,
    where A is the size of the alphabet.
2. Getting a byte array representing this object, so that the compressed
    sequence can be stored to a file

### 2. Example: LZ78 Encoding

In [None]:
# Make an input sequence to compress
stdout.flush()
data = " ".join(([lorem.paragraph() for _ in range(10_000)]))
charmap = CharacterMap(data)
charseq = Sequence(data, charmap=charmap)
encoder = LZ78Encoder()

#### `LZ78Encoder` Instance method: `encode`
Performs LZ78 encoding on an individual sequence, and returns a `CompressedSequence` object.

In [None]:
stdout.flush()
encoded = encoder.encode(charseq)

#### `CompressedSequence` Instance method: `compression_ratio`

In [None]:
encoded.compression_ratio()

#### Saving a `CompressedSequence` object
`CompressedSequence` has functionality to produce a `bytes` object representation, which can be written directly to a file.
The function `encoded_sequence_from_bytes` produces a `CompressedSequence` object from this `bytes` representation.

In [None]:
stdout.flush()
bytes = encoded.to_bytes()

os.makedirs("test_data", exist_ok=True)
with open("test_data/saved_encoded_sequence.bin", 'wb') as file:
    file.write(bytes)

Now, let's read the compressed sequence from the file and decode it.

In [None]:
with open("test_data/saved_encoded_sequence.bin", 'rb') as file:
    encoded_bytes = file.read()
encoded = encoded_sequence_from_bytes(encoded_bytes)

In [None]:
stdout.flush()
decoded = encoder.decode(encoded)

In [None]:
decoded.get_data()[:100]

In [None]:
assert decoded.get_data() == data

### 2.3 Block-Wise Compression
Sometimes, it might be useful to loop through blocks of data and perform LZ78 encoding on each block (e.g., if you need to do data processing before LZ78 compression and want to have some sort of pipeline parallelism).

The `BlockLZ78Encoder` has this functionality: you can pass in the input sequence to be compressed in chunks, and the output (`encoder.get_encoded_sequence()`) is as if the full concatenated sequence was passed in to an LZ78 encoder.

In [None]:
charmap = CharacterMap("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ. ,?")

In [None]:
encoder = BlockLZ78Encoder(charmap.alphabet_size())

#### Instance method: `encode_block`
Encodes a block using LZ78, starting at the end of the previous block.

All blocks must be over the same alphabet, or else the call to `encode_block` will error.

In [None]:
stdout.flush()
for _ in range(1000):
    encoder.encode_block(Sequence(lorem.paragraph(), charmap=charmap))

In [None]:
# Oops, this won't work!
encoder.encode_block(Sequence([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], alphabet_size=11))

#### Instance method: `get_encoded_sequence`
Returns the compressed sequence, which is equivalent to the output of `LZ78Encoder.encode` on the concatenation of all inputs to `encode_block` thus far.

In [None]:
encoded_sequence = encoder.get_encoded_sequence()
encoded_sequence.compression_ratio()

#### Instance method: `decode`
Decompresses the compressed sequence that has been constructed thus far.

In [None]:
stdout.flush()
decoded = encoder.decode()
print(decoded[376:400])
charmap.decode(decoded[376:400])