# Exploring the RoBERTa tokenizer 

First, let's import the necessary modules:

In [1]:
%%capture
!pip install transformers==3.5.1

In [1]:
from transformers import RobertaConfig, RobertaModel, RobertaTokenizer


Download and load the pre-trained RoBERTa model:

In [2]:
model = RobertaModel.from_pretrained('roberta-base')

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Let us check the configuration of our RoBERTa model: 

In [3]:
model.config

RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.27.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}


Now, let's download and load the RoBERTa tokenizer:


In [4]:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]


Let's tokenize the sentence: "It was a great day" using RoBERTa:  

In [5]:
tokenizer.tokenize('It was a great day')

['It', 'Ġwas', 'Ġa', 'Ġgreat', 'Ġday']



We can observe that the given sentence is tokenized but what's that character Ġ? It is used to indicate the space. The RoBERTa tokenizer, replace all the white space with the character Ġ. We can notice that Ġ is present before all the tokens but not in front of the first token because in the given sentence all other tokens have white space before them but the first token does not have white space before it. Let us tokenize the same sentence with additional white space in the front and see the results:


In [6]:
tokenizer.tokenize(' It was a great day')

['ĠIt', 'Ġwas', 'Ġa', 'Ġgreat', 'Ġday']



As we can observe, since we have added a white space in the front of the first token, now all the tokens are preceded with the character Ġ. 

Consider a different example. Let's tokenize the sentence: " I had a sudden epiphany": 


In [7]:
tokenizer.tokenize('I had a sudden epiphany')

['I', 'Ġhad', 'Ġa', 'Ġsudden', 'Ġep', 'iphany']


From the results, we can observe that the word epiphany is split into subwords ep and iphany. We can also observe notice how the whitespaces are replaced with the character Ġ.

To summarize, RoBERTa is a variant of BERT and it uses only the masked language modeling task for training. Unlike BERT, it uses dynamic masking instead of static masking and it is trained with large batch size. It uses the byte-level BPE as a tokenizer and it has a vocabulary of size 50K. 

Now that we learned how RoBERTa works, in the next section, let's look into another interesting variant of BERT called ELECTRA. 

