# Tokenization basics

In [1]:
from transformers import AutoTokenizer 

model_id = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
encoding = tokenizer.encode("Hi how are you?") # numericalize

In [3]:
encoding

[101, 7632, 2129, 2024, 2017, 1029, 102]

 A token is an atomic piece of text; The two extremes of tokenization are character and word tokenization. In one case, the vocabulary is too small, the splits are too fine-grained, leading to very long tokenized sequences. Further this does not provide enough meaningful language representation for the model to springboard from. With word tokenization, you get meaningful units, but the vocabulary is too big, too large to deal with in terms of embedding parameters you would need, because of the fact that words can contain declinations, punctuations, misspellings, etc. You can restrict the set of words in your vocabulary, but now you're stuck with figuring out how to handle words outside the vocabulary. The most popular form of tokenization is the middle ground: sub-word tokenization. The optimal way of breaking down text into different component sub-words is something that is learned from a relevant text corpus.

ordinal or nominal categorical data?

To find out the type of tokenizer you're using, I haven't found a great method yet. Let's just look at the docstring

In [10]:
print(tokenizer.__doc__)


    Construct a "fast" BERT tokenizer (backed by HuggingFace's *tokenizers* library). Based on WordPiece.

    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
    refer to this superclass for more information regarding those methods.

    Args:
        vocab_file (`str`):
            File containing the vocabulary.
        do_lower_case (`bool`, *optional*, defaults to `True`):
            Whether or not to lowercase the input when tokenizing.
        unk_token (`str`, *optional*, defaults to `"[UNK]"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        sep_token (`str`, *optional*, defaults to `"[SEP]"`):
            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
            sequence classification or for a text and a question for question answering. It is al

# What's happening when you load a tokenizer?


In [13]:
tokens = tokenizer.convert_ids_to_tokens(encoding)
print(tokens)
tokenizer.convert_tokens_to_string(tokens)

['[CLS]', 'hi', 'how', 'are', 'you', '?', '[SEP]']


'[CLS] hi how are you? [SEP]'

Embedding matrix
Each token has a token_id, and the token embedding for this token is retrieved by the relevant column in the embedding matrix of the model.

To be specific, the tokenizer __call__() boils to a call for the encode_plus() function (or the batched version).

Okay, but how exactly is a full piece of text split into tokens? Well, the exact algorithm you use depends on the tokenization scheme. With BPE, you first tokenize by characters, and then you perform merges if applicable by going over the list of merges recorded in the training stage (In HF, you will see a "merges" entry in your vocab file for this reason)

SentencePiece: Normalizer, Trainer, Encoder, and Decoder. From the paper:


 Normalizer is a module to normalize semantically- equivalent Unicode characters into canonical forms.Trainer trains the subword segmentation model from the normalized corpus. We specify a type of subword model as the parameter of Trainer. Encoder internally executes Normalizer to nor- malize the input text and tokenizes it into a sub- word sequence with the subword model trained by Trainer. Decoder converts the subword sequence into the normalized text.
Most modern tokenization schemes implement lossless tokenization, that is they preserve information regarding separating characters if present.

Training a new tokenizer from an old one:
https://huggingface.co/learn/nlp-course/chapter6/2  ". Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm"

# Input preprocessing

Some of the concepts that we've learned will be readily tested out when we're thinking about input preprocessing. Let's take the example of fine-tuning a causal langauge model. Suppose that we want to have a mixture of two datasets: One is forinstruction-following (OpenOrca) and another is causal langauge modelling (RefinedWeb). How does one preprocess these different datasets? 
1. For instruction-following, here are two examples:

We want our model to provide an answer to the question, and stop right there. that's the key part with instruction-following. remember that causal language models are trained to provide completions that are similar to the input text, and thus, if you do provide an instruction with few shot examples to a base model, it might give you an answer, but if you let it continue generating, it will generate more pairs of examples. So, we'd want to explicitly add an EOS token at the end here

2. FOr causal language modelling, we want the model to simply provide an appropriate completion for our dataset. Further, these datasets are usually just large text documents taken from the internet, and you often need to chunk them in order to restrict the sequence length. Here's a key detail to note: with text chunking, you do not want to add EOS tokens at the end of every sequence, because the text does not actually finish at the boundaries of our sequences i.e  if you let the model generate a few more tokens, we want it to keep going. The boundaries are self-contructed, and do not reflect actual end of sequence in the input text documents. 

# Going for a new language

First of all, how does a tokenizer that has never seen a character encode it? Such characters are out of vocabulary tokens (OOV),

# Alternatives to subword tokenization

Typos, variants in spelling and capitalization, and morphological changes can all cause the token representation of a word or phrase to change completely, which can result in mispredictions.

## ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By com- parison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text pre- processing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often in- troduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a stan- dard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte- level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments

# Training a new tokenizer

when would you do this? new language , new characters, new domain (like in galactica) or new style (language from a different century).


BERT example like in the NLP course: some words in a different langauge are split into many tokens, or get assigned an UNK token.  



## When would I need to do this?

![Alt text](image.png)


![Alt text](image-1.png)

Normalization: cleaning - lower casing, removing accents, etc
Pre-tokenization: splitting up into words
Model: tokenize
Post-processing : add special tokens

![Alt text](image-2.png)

![Alt text](image-3.png)

BPE: https://leimao.github.io/blog/Byte-Pair-Encoding/ 