# Byte-Pair Encoding

Byte-Pair Encoding is a sub-word level tokeniser.

- Normalise the corpus and pre-tokenise
- Split the data into word level tokens
- Split each token into character level tokens
- Recursively:
    - Count the number of occurrences of each adjacent token pair in the corpus
    - Replace the most frequent token-pair with a single token representing a pairing.
- Stop when occurrences drops below a certain level/vocabulary is large enough.

Example

"aaabdaaabac"

Tokens: a, b, c, d

First pass:
1. Occurrences: "aa" : 4, "ab" : 2, "bd": 1: "da": 1, "ba": 1, "ac": 1
2. Replace "aa" with X
3. Tokens: a, b, c, d, X

"XabdXabac"

Second pass:
1. Occurrences "Xa" : 2, "ab" : 2, "bd" : 1, "dX" : 1, "ba" : 1, "ac" : 1
2. Replace "Xa" with Y
3. Tokens, a, b, c, d, X, Y

"YbdYbac"
1. Occurrences: "Yb": 2, "bd" : 1, "dY": 1, "ba": 1, "ac": 1
2. Replace "Yb" with Z

"ZdZac"

In [6]:
from typing import Dict

def untokeniser(map_ : Dict[str, str], s : str) -> str:
    if (s == ""):
        return s
    if (s[0] in map_):
        return untokeniser(map_, map_[s[0]] + s[1:])
    return s[0] + untokeniser(map_, s[1:])

untokeniser(dict({'X' : "aa", 'Y': "Xa", 'Z': "Yb"}), "ZdZac")

'aaabdaaabac'

# Packages for Byte-Pair Encoding



In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "aaabdaaabac"
encoded = tokenizer.encode(text, add_special_tokens=True)
print(f"Encoded input: {encoded}")

decoded = tokenizer.decode(encoded)
print(f"Decoded input: {decoded}")

Encoded input: [7252, 397, 6814, 64, 397, 330]
Decoded input: aaabdaaabac
