## Load the sequence

First, load the text, but only if your computer has enough memory. If not, load just a part of it.  

Reading only a portion of the text helps prevent memory issues. If your computer runs out of memory, the BPE algorithm won’t work properly.  

You can change the `number_of_characters_to_read` setting to make it run better on your system. If memory isn’t a problem, you don’t need to set this limit.

In [None]:
# Make sure to change the path.
with open("./data/sequence_of_text.txt", "r") as f:
    number_of_characters_to_read = 10_000_000
    text_sequence = f.read(number_of_characters_to_read)

len(text_sequence)

## BPE algorithm

I am using the [minBPE](https://github.com/karpathy/minbpe) repository to tokenize the sequence of text. Start by training the tokenizer on the text sequence that you saved in the previous notebook.

In [None]:
from minbpe import RegexTokenizer

tokenizer = RegexTokenizer()
tokenizer.train(text_sequence, vocab_size=1024)

Visualize the vocabulary.

In [None]:
vocab = tokenizer.vocab
vocab

Test the tokenizer.

In [None]:
tokens = tokenizer.encode("السلام لاباس؟")
tokens

In [None]:
tokenizer.decode(tokens)

Add special tokens to the vocabulary. These tokens are going to be used a lot in the fine-tuning step.

In [7]:
max_vocab_id = list(tokenizer.vocab.keys())[-1]
tokenizer.special_tokens = {
    "<|startoftext|>": max_vocab_id + 1,
    "<|separator|>": max_vocab_id + 2,
    "<|endoftext|>": max_vocab_id + 3,
    "<|unk|>": max_vocab_id + 4,
    "<|padding|>": max_vocab_id + 5
}

Save the tokenizer

In [None]:
tokenizer.save(file_prefix="./output/base/darija_tokenizer")