One of the most important portion of the effort behind building a new transformer model is creating the new model tokenizer. The tokenizer is our translator from human-readable text, to transformer readable tokens. In this article, we will learn exactly how to build our own transformer tokenizer.

In this tutorial we will use the multi-lingual OSCAR dataset from huggingface to train a tokenizer.

In [None]:
# !pip install datasets
# !pip install tokenizers

### Prepare dataset

In [None]:
import datasets

In [None]:
all_ds = datasets.list_datasets()
print("total number of datasets", len(all_ds))

In [None]:
all_ds[:10]

In [None]:
# select the OSCAR dataset
dataset = datasets.load_dataset('oscar', 'unshuffled_deduplicated_la')


In [None]:
dataset

In [None]:
# First record
# From here we can see that the Latin subset contains 18.8K samples, where each sample is a dictionary containing an id and text.
dataset['train'][0]

We will store all of our samples in plain text files, separating each sample by a newline character.
We will split each text file into chunks of 5K samples each (although not necessary with a dataset of this size — it’s required for large datasets) and save them into a new oscar_la directory.

In [None]:
from tqdm.auto import tqdm

text_data = []
file_count = 0

for sample in tqdm(dataset['train']):
    # remove newline characters from each sample as we need to use exclusively as seperators
    sample = sample['text'].replace('\n', '')
    text_data.append(sample)
    if len(text_data) == 5_000:
        # once we hit the 5K mark, save to file
        with open(f'data/oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1

# after saving in 5K chunks, we will have ~3808 leftover samples, we save those now too
with open(f'data/oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))

### Train a Latin roBERTa Tokenizer

To train the tokenizer we make use of the Byte pair encoding (not specific to NLP - its a compression algorithm)

Byte-level encoding means we will be building our tokenizer vocabulary from an alphabet of bytes. Thanks to this, all words will be decomposable into tokens — even new words — and so we will not need special unknown tokens.

We need a list of files to feed into our tokenizer’s training process, we will list all .txt files from our oscar_la directory.

In [None]:
# list files
from pathlib import Path
paths = [str(x) for x in Path('data/oscar_la').glob('**/*.txt')]
paths

We use roBERTa special tokens, a vocabulary size of 30522 tokens, and a minimum frequency (number of times a token appears in the data for us to take notice) of 2

In [None]:
from tokenizers import ByteLevelBPETokenizer
import os

# initialize
tokenizer = ByteLevelBPETokenizer()

# and train
tokenizer.train(files=paths, 
                vocab_size=30_522, 
                min_frequency=2, # number of times a token appears in the data
                special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

In [None]:
# save tokenizer
tk_path = 'models/bertius'
os.mkdir(tk_path)

tokenizer.save_model(tk_path)

### Using the Tokenizer

In [None]:
from transformers import RobertaTokenizerFast

tk_path = 'models/bertius'
tokenizer = RobertaTokenizerFast.from_pretrained(tk_path)

In [None]:
# example
lorem_ipsum = (
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor "
    "incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud "
    "exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute "
    "irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla "
    "pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia "
    "deserunt mollit anim id est laborum."
)

lorem_ipsum = (
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor "
    "incididunt ut labore et dolore magna aliqua."
)

In [None]:
# we'll include the typical padding/truncation
lorem_tk = tokenizer(lorem_ipsum, 
                     max_length=512, 
                     padding='max_length', 
                     truncation=True)


In [None]:
lorem_tk.keys()

Here we can see our two tensors, input_ids and attention_mask. In input_ids we can see our start of sequence token __<*s*>__ represented by 0, the end of sequence token <s\\> represented by 2, and padding tokens <*pad*> represented by 1.

In [None]:
lorem_tk

## Training a transformer from scratch
1. [Implementing the Transformer Encoder from Scratch in TensorFlow and Keras](https://machinelearningmastery.com/joining-the-transformer-encoder-and-decoder-and-masking)
2. [Implementing the Transformer Decoder from Scratch in TensorFlow and Keras](https://machinelearningmastery.com/implementing-the-transformer-decoder-from-scratch-in-tensorflow-and-keras/)
3. [Training the Transformer Model](https://machinelearningmastery.com/training-the-transformer-model/)
4. [Building Transformer Models with Attention](https://machinelearningmastery.com/building-transformer-models-with-attention-crash-course-build-a-neural-machine-translator-in-12-days/)
5. [Training Compact Transformers from Scratch in 30 Minutes with PyTorch](https://medium.com/pytorch/training-compact-transformers-from-scratch-in-30-minutes-with-pytorch-ff5c21668ed5)

# Assigment 2 
### Due: August 27, 2023
In this assignment, you will construct a language models for any of the African language from [CC-100](https://data.statmt.org/cc-100/) using the huggginface platform. Provide the following information
1. Ensure you follow the appropriate machine learning paradigm to train your model
2. define/us an appropriate metric to evaluate the performance of the language model
3. Perform error analysis and give reasons for the limitations of your model if any.

Data source: [CC-100: Monolingual Datasets from Web Crawl Data](https://data.statmt.org/cc-100/)
