# Tokenization
This notebook aims to help gain a better understanding of tokenization by defining a tokenization function that decomposes the input text into a vocabulary of words, which is then converted into tokens. In addition, the tokens are refinied by joining the token pair with the highest frequency.

### Imports

In [1]:
import re, collections

### Define Input Text

In [2]:
text = "a sailor went to sea sea sea "+\
                  "to see what he could see see see "+\
                  "but all that he could see see see "+\
                  "was the bottom of the deep blue sea sea sea"

### Define Vocabulary Function
Define a function that separates the input text into individual words

In [3]:
def initialize_vocabulary(text):
  vocab = collections.defaultdict(int)
  words = text.strip().split()
  for word in words:
      vocab[' '.join(list(word)) + ' '] += 1
  return vocab

### Execute initialize_vocabulary
Execute initialize_vocabulary to compute a list of vocabulary words

In [4]:
vocab = initialize_vocabulary(text)

### Print Vocabulary Words and Size
Print all the words in the vocabulary and the size of the vocabulary

In [5]:
print('Vocabulary: {}'.format(vocab))
print('Size of vocabulary: {}'.format(len(vocab)))

Vocabulary: defaultdict(<class 'int'>, {'a ': 1, 's a i l o r ': 1, 'w e n t ': 1, 't o ': 2, 's e a ': 6, 's e e ': 7, 'w h a t ': 1, 'h e ': 2, 'c o u l d ': 2, 'b u t ': 1, 'a l l ': 1, 't h a t ': 1, 'w a s ': 1, 't h e ': 2, 'b o t t o m ': 1, 'o f ': 1, 'd e e p ': 1, 'b l u e ': 1})
Size of vocabulary: 18


### Define Token Function
Define a function that computes tokens from the individual words and computes the frequency of each token

In [6]:
def get_tokens_and_frequencies(vocab):
  tokens = collections.defaultdict(int)
  for word, freq in vocab.items():
      word_tokens = word.split()
      for token in word_tokens:
          tokens[token] += freq
  return tokens

### Execute get_tokens_and_frequencies
Execute get_tokens_and_frequencies to compute a list of tokens and their frequencies

In [7]:
tokens = get_tokens_and_frequencies(vocab)

### Print Tokens and Frequencies
Print a list of all the computed tokens as well as each of their frequencies

In [8]:
print('Tokens: {}'.format(tokens))
print('Number of tokens: {}'.format(len(tokens)))

Tokens: defaultdict(<class 'int'>, {'a': 12, 's': 15, 'i': 1, 'l': 6, 'o': 8, 'r': 1, 'w': 3, 'e': 28, 'n': 1, 't': 11, 'h': 6, 'c': 2, 'u': 4, 'd': 3, 'b': 3, 'm': 1, 'f': 1, 'p': 1})
Number of tokens: 18


### Define Pair Function
Define a function that computes all the letter pairs in the vocabulary and their frequencies

In [9]:
def get_pairs_and_counts(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

### Execute get_pairs_and_counts
Execute get_pairs_and_counts to compute all the letter pairs and each of their frequences

In [10]:
pairs = get_pairs_and_counts(vocab)

### Print Letter Pairs and Frequencies
Print all the letter pairs and each of their frequencies

In [11]:
print('Pairs: {}'.format(pairs))
print('Number of distinct pairs: {}'.format(len(pairs)))

Pairs: defaultdict(<class 'int'>, {('s', 'a'): 1, ('a', 'i'): 1, ('i', 'l'): 1, ('l', 'o'): 1, ('o', 'r'): 1, ('w', 'e'): 1, ('e', 'n'): 1, ('n', 't'): 1, ('t', 'o'): 3, ('s', 'e'): 13, ('e', 'a'): 6, ('e', 'e'): 8, ('w', 'h'): 1, ('h', 'a'): 2, ('a', 't'): 2, ('h', 'e'): 4, ('c', 'o'): 2, ('o', 'u'): 2, ('u', 'l'): 2, ('l', 'd'): 2, ('b', 'u'): 1, ('u', 't'): 1, ('a', 'l'): 1, ('l', 'l'): 1, ('t', 'h'): 3, ('w', 'a'): 1, ('a', 's'): 1, ('b', 'o'): 1, ('o', 't'): 1, ('t', 't'): 1, ('o', 'm'): 1, ('o', 'f'): 1, ('d', 'e'): 1, ('e', 'p'): 1, ('b', 'l'): 1, ('l', 'u'): 1, ('u', 'e'): 1})
Number of distinct pairs: 37


### Compute and Print Most Frequent Letter Pair
Compute and print the most frequent letter pair out of all the letter pairs that were computed

In [12]:
most_frequent_pair = max(pairs, key=pairs.get)
print('Most frequent pair: {}'.format(most_frequent_pair))

Most frequent pair: ('s', 'e')


### Define Pair Merge Function
Define a function that merges letter pairs in the vocabulary

In [13]:
def merge_pair_in_vocabulary(pair, vocab_in):
    vocab_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in vocab_in:
        word_out = p.sub(''.join(pair), word)
        vocab_out[word_out] = vocab_in[word]
    return vocab_out

### Execute merge_pair_in_vocabulary 
Execute merge_pair_in_vocabulary to merge the most frequent letter pairs and reupdate the vocabulary

In [14]:
vocab = merge_pair_in_vocabulary(most_frequent_pair, vocab)

### Print Vocabulary and Size
Print the newly updated vocabulary and the vocabulary size

In [15]:
print('Vocabulary: {}'.format(vocab))
print('Size of vocabulary: {}'.format(len(vocab)))

Vocabulary: {'a ': 1, 's a i l o r ': 1, 'w e n t ': 1, 't o ': 2, 'se a ': 6, 'se e ': 7, 'w h a t ': 1, 'h e ': 2, 'c o u l d ': 2, 'b u t ': 1, 'a l l ': 1, 't h a t ': 1, 'w a s ': 1, 't h e ': 2, 'b o t t o m ': 1, 'o f ': 1, 'd e e p ': 1, 'b l u e ': 1}
Size of vocabulary: 18


### Execute get_tokens_and_frequencies
Execute get_tokens_and_frequencies to get the new tokens and its frequencies from the new vocabulary

In [16]:
tokens = get_tokens_and_frequencies(vocab)

### Print Tokens and Frequencies
Print all the tokens and each of the token's frequencies

In [17]:
print('Tokens: {}'.format(tokens))
print('Number of tokens: {}'.format(len(tokens)))

Tokens: defaultdict(<class 'int'>, {'a': 12, 's': 2, 'i': 1, 'l': 6, 'o': 8, 'r': 1, 'w': 3, 'e': 15, 'n': 1, 't': 11, 'se': 13, 'h': 6, 'c': 2, 'u': 4, 'd': 3, 'b': 3, 'm': 1, 'f': 1, 'p': 1})
Number of tokens: 19


### Define Tokenization Function
Define a tokenization function, which uses the previously defined function to convert input texts into tokens

In [18]:
def tokenize(text, num_merges):
  # Initialize the vocabulary from the input text
  vocab = initialize_vocabulary(text)

  # For each pair merge in a defined amount of pair merges
  for i in range(num_merges):
    # Compute the tokens and their frequency in the vocabulary
    tokens = get_tokens_and_frequencies(vocab)

    # Compute the pairs of adjacent tokens and their frequencies
    pairs = get_pairs_and_counts(vocab)

    # Find the most frequent pair
    most_frequent_pair = max(pairs, key=pairs.get)
    print('Most frequent pair: {}'.format(most_frequent_pair))

    # Merge the most frequent pair in the vocabulary
    vocab = merge_pair_in_vocabulary(most_frequent_pair, vocab)

  # Compute the tokens and their frequency in the vocabulary
  tokens = get_tokens_and_frequencies(vocab)

  return tokens, vocab

### Tokenize Input Text

In [19]:
tokens, vocab = tokenize(text, num_merges=22)

Most frequent pair: ('s', 'e')
Most frequent pair: ('se', 'e')
Most frequent pair: ('se', 'a')
Most frequent pair: ('h', 'e')
Most frequent pair: ('t', 'o')
Most frequent pair: ('h', 'a')
Most frequent pair: ('ha', 't')
Most frequent pair: ('c', 'o')
Most frequent pair: ('co', 'u')
Most frequent pair: ('cou', 'l')
Most frequent pair: ('coul', 'd')
Most frequent pair: ('t', 'he')
Most frequent pair: ('s', 'a')
Most frequent pair: ('sa', 'i')
Most frequent pair: ('sai', 'l')
Most frequent pair: ('sail', 'o')
Most frequent pair: ('sailo', 'r')
Most frequent pair: ('w', 'e')
Most frequent pair: ('we', 'n')
Most frequent pair: ('wen', 't')
Most frequent pair: ('w', 'hat')
Most frequent pair: ('b', 'u')


### Print Tokens, Frequency of Tokens, Vocabulary, and Size of Vocabulary

In [20]:
print('Tokens: {}'.format(tokens))
print('Number of tokens: {}'.format(len(tokens)))
print('Vocabulary: {}'.format(vocab))
print('Size of vocabulary: {}'.format(len(vocab)))

Tokens: defaultdict(<class 'int'>, {'a': 3, 'sailor': 1, 'went': 1, 'to': 3, 'sea': 6, 'see': 7, 'what': 1, 'he': 2, 'could': 2, 'bu': 1, 't': 3, 'l': 3, 'hat': 1, 'w': 1, 's': 1, 'the': 2, 'b': 2, 'o': 2, 'm': 1, 'f': 1, 'd': 1, 'e': 3, 'p': 1, 'u': 1})
Number of tokens: 24
Vocabulary: {'a ': 1, 'sailor ': 1, 'went ': 1, 'to ': 2, 'sea ': 6, 'see ': 7, 'what ': 1, 'he ': 2, 'could ': 2, 'bu t ': 1, 'a l l ': 1, 't hat ': 1, 'w a s ': 1, 'the ': 2, 'b o t to m ': 1, 'o f ': 1, 'd e e p ': 1, 'b l u e ': 1}
Size of vocabulary: 18
