<a href="https://colab.research.google.com/github/Harsh-2909/NLP-Colab-Notebooks/blob/main/course/chapter2/section4_pt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizers (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.

In [2]:
# Using Python's #split() function as a word based tokenizer
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


In [6]:
from transformers import BertTokenizer, AutoTokenizer

# We can use either BertTokenizer or AutoTokenizer with the #from_pretrained function to get the tokenizer used for the 'bert-base-cased' checkpoint.
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
# print(tokenizer)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# print(tokenizer)
# Take a note from the output that both of the tokenizers are basically the same from their functionality.
# One returns BertTokenizer and another one returns BertTokenizerFast

In [4]:
tokenizer("Using a Transformer network is simple but gaining mastery is difficult.")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 1133, 8289, 3283, 1183, 1110, 2846, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
# We can use the #save_pretrained function to save the tokenizer in any directory of our system.
tokenizer.save_pretrained("directory_on_my_computer")

# The Tokenization Pipeline

Here, we will break down the tokenizer into multiple steps to see how it works. It is recommended to directly use the tokenizer instead of manually doing these steps.

This is the process of tokenization:
1. The sentence is broken down into tokens using any algorithm (like word or sub-word based tokenizer) (eg: WordPiece used in BERT).
2. The tokens are converted to the IDs based on the tokenizer vocabulary.
3. Special tokens IDs are added (like [SEP]) to the token array so that it can be sent to the model.

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a transformer network is simple but resource consuming"
# Tokenizing it directly
tokens = tokenizer(sequence, return_tensors="pt")
print(tokens)

# Following the tokenization pipeline
tokens = tokenizer.tokenize(sequence)
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
ids = tokenizer.prepare_for_model(ids, return_tensors="pt")
print(ids)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': tensor([[  101,  7993,   170, 11303,  1200,  2443,  1110,  3014,  1133,  9100,
         16114,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
['Using', 'a', 'transform', '##er', 'network', 'is', 'simple', 'but', 'resource', 'consuming']
[7993, 170, 11303, 1200, 2443, 1110, 3014, 1133, 9100, 16114]
{'input_ids': tensor([  101,  7993,   170, 11303,  1200,  2443,  1110,  3014,  1133,  9100,
        16114,   102]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}


In [13]:
# Decoding the token ids is very simple
decoded_string = tokenizer.decode(ids['input_ids'])
# decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

[CLS] Using a transformer network is simple but resource consuming [SEP]
