In [1]:
from transformers import AutoTokenizer
model_type = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_type)

Tokenizer encodes string into tokens, and decode tokens into string.
Tokenizer is model specific, so we need to use the same tokenizer that was used during training.
The common parameter of the tokenizer is:
* vocab_size: Number of tokens in the vocabulary.


In [2]:
print(f"The tokenizer has {tokenizer.vocab_size} in the vocabulary.")

The tokenizer has 50257 in the vocabulary.


Special tokens:
Some special tokens are added to the vocabulary:
* bos_token: Beginning of sentence token. It is added at the beginning of the input sequence. GPT2 model don't have this special token.
* eos_token: End of sentence token. When model generates this token, it stops generating further tokens.
* unk_token: Unknown token. When model encounters a token that is not in the vocabulary, it replaces it with this token.
* padding token: when training the language model, we expect the input sequences to be of the same length to form a batch. If the input sequence is shorter than the model_max_length, it is padded with this token.

GPT2 only has eos_token as special token. It doesn't have bos_token, unk_token, and padding token.

In [3]:
# eos_token (end of sentence token) is the token that is added to the end of the input text
text = "Hello,"
print(f"The eos_token is {tokenizer.eos_token}")
print(f"The id of the eos_token is {tokenizer.eos_token_id}")
tokens = tokenizer.encode(text + tokenizer.eos_token, add_special_tokens=True)
print("The last token of the encoded text is", tokens[-1])

The eos_token is <|endoftext|>
The id of the eos_token is 50256
The last token of the encoded text is 50256


We change to another model which has bos_token, unk_token, and padding token.

In [4]:
# bos_token (begin of sentence token) is the token that is added to the beginning of the input text
# https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
model_type = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_type)
text = "Hello,"
print(f"The bos_token is {tokenizer.bos_token}")
print(f"The id of the bos_token is {tokenizer.bos_token_id}")
tokens = tokenizer.encode(text)
print("The first token of the encoded text is", tokens[0])

The bos_token is <s>
The id of the bos_token is 1
The first token of the encoded text is 1


In [5]:
# unk_token (unknown token) is the token that is used when a token is not in the vocabulary
text = "Hello,"
print(f"The unk_token is {tokenizer.unk_token}")
print(f"The id of the unk_token is {tokenizer.unk_token_id}")
tokens = tokenizer.encode(text + tokenizer.unk_token, add_special_tokens=True)
print("The last token of the encoded text is", tokens[-1])

The unk_token is <unk>
The id of the unk_token is 0
The last token of the encoded text is 0


In [6]:
# pad_token (padding token) is the token that is used to pad the input text to the same length
text = "Hello,"
print(f"The pad_token is {tokenizer.pad_token}")
print(f"The id of the pad_token is {tokenizer.pad_token_id}")
tokens = tokenizer.encode(text + tokenizer.pad_token, add_special_tokens=True)
print("The last token of the encoded text is", tokens[-1])

The pad_token is </s>
The id of the pad_token is 2
The last token of the encoded text is 2


In [7]:
# eos_token (end of sentence token) is the token that is added to the end of the input text
text = "Hello,"
print(f"The eos_token is {tokenizer.eos_token}")
print(f"The id of the eos_token is {tokenizer.eos_token_id}")
tokens = tokenizer.encode(text + tokenizer.eos_token, add_special_tokens=True)
print("The last token of the encoded text is", tokens[-1])

The eos_token is </s>
The id of the eos_token is 2
The last token of the encoded text is 2


We found that the eos_token is identical to pad_token in the tokenizer. Since padding token is used to pad the input sequence, no tokens will appear after the padding token. So, we can use the padding token as eos_token.

We can investigate the tokenizer in tokenizer_config.json file.  
https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/blob/main/tokenizer_config.json