## Load the sequence

This time, we'll load a sample from the text sequence instead of the entire dataset to prevent excessive RAM usage. If the RAM is full, the BPE algorithm won't function properly due to a lack of available memory.  

Adjust the `number_of_characters_to_read` value to find the optimal setting for your system.

In [1]:
with open("../data/OPUSCombined.txt", "r") as f:
    number_of_characters_to_read = 10_000_000
    text_sequence = f.read(number_of_characters_to_read)

len(text_sequence)

10000000

## BPE algorithm

I am using the [minBPE](https://github.com/karpathy/minbpe) repository to tokenize the sequence of text.

In [2]:
import sys
sys.path.append('..')

Start by training the tokenizer on the text sequence that you saved in the previous notebook.

In [3]:
from minbpe import RegexTokenizer

tokenizer = RegexTokenizer()
tokenizer.train(text_sequence, vocab_size=16_384)

100%|██████████| 16128/16128 [11:45:59<00:00,  2.63s/it]  


Visualize the vocabulary.

In [4]:
vocab = tokenizer.vocab
vocab

{0: b'\x00',
 1: b'\x01',
 2: b'\x02',
 3: b'\x03',
 4: b'\x04',
 5: b'\x05',
 6: b'\x06',
 7: b'\x07',
 8: b'\x08',
 9: b'\t',
 10: b'\n',
 11: b'\x0b',
 12: b'\x0c',
 13: b'\r',
 14: b'\x0e',
 15: b'\x0f',
 16: b'\x10',
 17: b'\x11',
 18: b'\x12',
 19: b'\x13',
 20: b'\x14',
 21: b'\x15',
 22: b'\x16',
 23: b'\x17',
 24: b'\x18',
 25: b'\x19',
 26: b'\x1a',
 27: b'\x1b',
 28: b'\x1c',
 29: b'\x1d',
 30: b'\x1e',
 31: b'\x1f',
 32: b' ',
 33: b'!',
 34: b'"',
 35: b'#',
 36: b'$',
 37: b'%',
 38: b'&',
 39: b"'",
 40: b'(',
 41: b')',
 42: b'*',
 43: b'+',
 44: b',',
 45: b'-',
 46: b'.',
 47: b'/',
 48: b'0',
 49: b'1',
 50: b'2',
 51: b'3',
 52: b'4',
 53: b'5',
 54: b'6',
 55: b'7',
 56: b'8',
 57: b'9',
 58: b':',
 59: b';',
 60: b'<',
 61: b'=',
 62: b'>',
 63: b'?',
 64: b'@',
 65: b'A',
 66: b'B',
 67: b'C',
 68: b'D',
 69: b'E',
 70: b'F',
 71: b'G',
 72: b'H',
 73: b'I',
 74: b'J',
 75: b'K',
 76: b'L',
 77: b'M',
 78: b'N',
 79: b'O',
 80: b'P',
 81: b'Q',
 82: b'R',
 83: b'

Test the tokenizer.

In [5]:
tokenizer.encode("Hola como estas?")

[13515, 388, 1304, 63]

In [6]:
tokenizer.decode([83, 1813, 3363, 32, 7312, 3770, 115])

'S idea cincuenta  dirigirse ruegos'

Add special tokens to the vocabulary. These tokens are going to be used a lot in the fine-tuning step.

In [7]:
max_vocab_id = list(tokenizer.vocab.keys())[-1]
tokenizer.special_tokens = {
    "<|startoftext|>": max_vocab_id + 1,
    "<|separator|>": max_vocab_id + 2,
    "<|endoftext|>": max_vocab_id + 3,
    "<|unk|>": max_vocab_id + 4,
    "<|padding|>": max_vocab_id + 5
}

Save the tokenizer

In [8]:
tokenizer.save(file_prefix="../output/tokenizer/dataset_tokenizer")