# LZ78 Usage Tutorial: Sequences
**Note**: This is a prerequisite for the `EncoderDecoderTutorial.ipynb` and `SPATutorial.ipynb`!

## Prerequisites
1. Follow the setup instructions in `tutorials/README.md`
2. In the same Python environment as you used for that tutorial, run `pip install ipykernel`
3. Use that Python environment as the kernel for this notebook.

## Important Note
Sometimes, Jupyter doesn't register that a cell containing code from the `lz78` library has started running, so it seems like the cell is waiting to run until it finishes.
This can be annoying for operations that take a while to run, and **can be remedied by putting `stdout.flush()` at the beginning of the cell**.

## Imports

In [None]:
from lz78 import Sequence, CharacterMap
import lorem
import numpy as np
from sys import stdout

## Sequences

Any sequence of data that can be LZ78-encoded (i.e., a list of integers or a String) is represented as a `Sequence` object.
Storing sequences as this object (as opposed to raw lists or strings) allows for a common interface that streamlines the LZ78 encoding process.

Each sequence is associated with an alphabet size, A.

If the sequence consists of integers, they must be in the range ${0, 1, ..., A-1}$.
If $A < 256$, the sequence is stored internally as bytes.
Otherwise, it is stored as `uint32`.

If the sequence is a string, a `CharacterMap` object maps each character to a number between 0 and A-1.
More on this later.

**Inputs**:
- data: either a list of integers or a string.
- alphabet_size (optional): the size of the alphabet.
    If this is `None`, then the alphabet size is inferred from the data.
- charmap (optional): A `CharacterMap` object; only valid if `data` is a string.
    If `data` is a string and this is `None`, then the character map is inferred from the data.

The methods available for a `Sequence` object are described below.

### 1. Example: Integer Sequence

In [None]:
data = np.random.randint(0, 2, size=(1_000_000,))
int_sequence = Sequence(data, alphabet_size=2)

You must specify the alphabet size when instantiating an integer sequence.
This is because the LZ78 compressor relies on the alphabet size encoded in the `Sequence` object to compress.
The alphabet size associated with a sequence is also used to ensure that a SPA is only trained on sequences from the same alphabet.

In [None]:
# This will fail
int_sequence = Sequence([1, 2, 3, 4])

A limited number of Python list operations work on `Sequence`:

In [None]:
print(len(int_sequence))
print(int_sequence[-20:])

As a note, indexing a string-based sequence in this manner will return the integer-based representation of the string and not the string itself. You will have to use the corresponding character map to map these integers back to a string representation.

#### Instance method: `extend`

Adds data to the end of the sequence.
Data must be over the same alphabet as the current sequence.

In [None]:
more_data = np.random.randint(0, 2, size=(200,))
int_sequence.extend(more_data)

In [None]:
len(int_sequence)

#### Instance method: `alphabet_size`

In [None]:
int_sequence.alphabet_size()

#### Instance method: `get_data`
Returns the full sequence as an integer list or string.

In [None]:
extracted_data = int_sequence.get_data()
print(type(extracted_data))
print(extracted_data[-20:])

### 2. `CharacterMap`
A sequence is defined as integers from 0 to A-1, where A is the alphabet size, so we need a way to map strings to such integer-based sequences.

The `CharacterMap` class maps characters in a string to integer values in a contiguous range, so that a string can be used as an individual sequence.
It has the capability to **encode** a string into the corresponding integer representation, and **decode** a list of integers into a string.

Inputs:
- data: a string consisting of all of the characters that will appear in the character map. For instance, a common use case is:
    ```
    charmap = CharacterMap("abcdefghijklmnopqrstuvwxyz")
     ```

In [None]:
# generate some dummy data and make a character map
s = " ".join(([lorem.paragraph() for _ in range(10)]))
charmap = CharacterMap(s)

#### Instance method: `encode`
Takes a string and returns the corresponding integer representation.

In [None]:
charmap.encode("lorem ipsum")

It errors if any characters to be encoded are not in the alphabet.

In [None]:
# this should error, but with a helpful warning message!
charmap.encode("hello world")

#### Instance method: `filter_string`
Takes a string and removes any characters that are not present in the character mapping.
This is useful if you have some text with special characters, and you don't want the special characters to be in the alphabet.

In [None]:
charmap.filter_string("hello world. Lorem ipsum! @#$%^&*()")

You can also replace all of the characters that are not present.

In [None]:
charmap.add("~")
charmap.filter_string_and_replace("hello world. Lorem ipsum! @#$%^&*()", "~")

#### Instance method: `decode`
Decodes an integer representation of a string into the string itself

In [None]:
charmap.decode(charmap.encode("lorem ipsum"))

#### Instance method: `alphabet_size`
Returns how many characters can be represented by the character mapping

In [None]:
charmap.alphabet_size()

### 2.1 Example: Character Sequence
A string-based sequence is sometimes referred to as a character sequence. It has the same interface as an integer sequence, except there is an underlying `CharacterMap` object that maps characters to corresponding integer values within the alphabet.

You can pass in a `CharacterMap` upon instantiation, or else the character map will be inferred from the data.

**Note**: if you pass in a `CharacterMap`, and the input string has characters not present in the character map, instantiation will error.
To avoid this, you can use `CharacterMap.filter` beforehand.

In [None]:
stdout.flush()
charmap = CharacterMap("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ. ?,")
s = " ".join(([lorem.paragraph() for _ in range(1000)]))
charseq = Sequence(s, charmap=charmap)

As with the alphabet size stipulation when instantiating an integer sequence, you must specify a character map.

In [None]:
seq = Sequence("this will fail!")

Indexing a character sequence returns the integer representations of the corresponding characters.

In [None]:
print(charseq[100:130])

#### Instance method: `get_character_map`
Returns the underlying `CharacterMap` object.
This will error if the sequence is not a character sequence.

In [None]:
charmap = charseq.get_character_map()
charmap.decode(charseq[100:130])