# Tokenizers (PyTorch)

Install the Transformers and Datasets libraries to run this notebook.

In [1]:
!pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-1.11.0-py3-none-any.whl (264 kB)
[?25l[K     |█▎                              | 10 kB 23.7 MB/s eta 0:00:01[K     |██▌                             | 20 kB 27.2 MB/s eta 0:00:01[K     |███▊                            | 30 kB 19.7 MB/s eta 0:00:01[K     |█████                           | 40 kB 16.1 MB/s eta 0:00:01[K     |██████▏                         | 51 kB 7.7 MB/s eta 0:00:01[K     |███████▍                        | 61 kB 7.6 MB/s eta 0:00:01[K     |████████▋                       | 71 kB 7.9 MB/s eta 0:00:01[K     |██████████                      | 81 kB 8.8 MB/s eta 0:00:01[K     |███████████▏                    | 92 kB 9.3 MB/s eta 0:00:01[K     |████████████▍                   | 102 kB 7.3 MB/s eta 0:00:01[K     |█████████████▋                  | 112 kB 7.3 MB/s eta 0:00:01[K     |██████████████▉                 | 122 kB 7.3 MB/s eta 0:00:01[K     |████████████████                | 133 kB 7.3 MB/s eta 0:00:01

In [2]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


In [3]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [5]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [6]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

In [7]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


In [8]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


In [9]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple
