#  **Understanding Tokenizers with BERT**

This notebook shows how to use BERT tokenizers to turn text into data that the model can understand.

##  Setup and Installation

First, we need to install the libraries we will use.

In [3]:
!pip install pandas==2.0.1
!pip install transformers==4.29.2

Collecting pandas==2.0.1
  Downloading pandas-2.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.0.3
    Uninstalling pandas-2.0.3:
      Successfully uninstalled pandas-2.0.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.0.3, but you have pandas 2.0.1 which is incompatible.[0m[31m
[0mSuccessfully installed pandas-2.0.1
Collecting transformers==4.29.2
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,


##  Importing Libraries

We import the libraries necessary for our tasks.

In [4]:
# Import required libraries
from transformers import BertModel, AutoTokenizer
import pandas as pd


##  Model Setup

We load a pre-trained BERT model and its tokenizer.

In [5]:
# Specify the pre-trained model to use: BERT-base-cased
model_name = "bert-base-cased"

# Instantiate the model and tokenizer for the specified pre-trained model

model = BertModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)


##  Tokenizing Text

We use the tokenizer to turn a sentence into tokens.

In [6]:
# Set a sentence for analysis
sentence = "Level up with the largest AI & ML community. Join over 18M+ machine learners."

In [7]:
# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)
tokens

['Level',
 'up',
 'with',
 'the',
 'largest',
 'AI',
 '&',
 'M',
 '##L',
 'community',
 '.',
 'Jo',
 '##in',
 'over',
 '18',
 '##M',
 '+',
 'machine',
 'learn',
 '##ers',
 '.']


## Vocabulary and Token IDs

We create a DataFrame to see the tokenizer's vocabulary and sort it by token IDs.

In [8]:
# Create a DataFrame with the tokenizer's vocabulary
vocab = tokenizer.vocab
vocab_df = pd.DataFrame({"token": vocab.keys(), "token_id": vocab.values()})
vocab_df = vocab_df.sort_values(by="token_id").set_index("token_id")

vocab_df

Unnamed: 0_level_0,token
token_id,Unnamed: 1_level_1
0,[PAD]
1,[unused1]
2,[unused2]
3,[unused3]
4,[unused4]
...,...
28991,##）
28992,##，
28993,##－
28994,##／


## Encoding and Decoding

Encode the sentence into IDs and then decode it back to text.

In [9]:
# Encode the sentence into token_ids using the tokenizer
token_ids = tokenizer.encode(sentence)
token_ids

[101,
 9583,
 1146,
 1114,
 1103,
 2026,
 19016,
 111,
 150,
 2162,
 1661,
 119,
 8125,
 1394,
 1166,
 1407,
 2107,
 116,
 3395,
 3858,
 1468,
 119,
 102]


## Compare Token Lengths

Compare the length of tokens and token IDs.


In [10]:

# Print the length of tokens and token_ids
print("Number of tokens:", len(tokens))
print("Number of token IDs:", len(token_ids))


Number of tokens: 21
Number of token IDs: 23



## Explore Token Data

Look at specific tokens by their IDs.

In [11]:
# Access the tokens in the vocabulary DataFrame by index
print("Token at position 101:", vocab_df.iloc[101])
print("Token at position 111:", vocab_df.iloc[111])
print("Token at position 19016:", vocab_df.iloc[19016])

Token at position 101: token    [CLS]
Name: 101, dtype: object
Token at position 111: token    &
Name: 111, dtype: object
Token at position 19016: token    AI
Name: 19016, dtype: object


## Token and ID Pairing

Show pairs of tokens and their IDs.

In [12]:
# Zip tokens and token_ids (excluding the first and last token_ids for [CLS] and [SEP])
list(zip(tokens, token_ids[1:-1]))

[('Level', 9583),
 ('up', 1146),
 ('with', 1114),
 ('the', 1103),
 ('largest', 2026),
 ('AI', 19016),
 ('&', 111),
 ('M', 150),
 ('##L', 2162),
 ('community', 1661),
 ('.', 119),
 ('Jo', 8125),
 ('##in', 1394),
 ('over', 1166),
 ('18', 1407),
 ('##M', 2107),
 ('+', 116),
 ('machine', 3395),
 ('learn', 3858),
 ('##ers', 1468),
 ('.', 119)]

In [13]:
# Decode the token_ids (excluding the first and last token_ids for [CLS] and [SEP]) back into the original sentence
tokenizer.decode(token_ids[1:-1])

'Level up with the largest AI & ML community. Join over 18M + machine learners.'

In [14]:
# Tokenize the sentence using the tokenizer's `__call__` method
tokenizer_out = tokenizer(sentence)
tokenizer_out

{'input_ids': [101, 9583, 1146, 1114, 1103, 2026, 19016, 111, 150, 2162, 1661, 119, 8125, 1394, 1166, 1407, 2107, 116, 3395, 3858, 1468, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## Handling Multiple Sentences

Tokenize two sentences with and without padding, and decode them.

In [15]:
# Create a new sentence by removing "don't " from the original sentence
sentence2 = sentence.replace("largest", "greatest and best")
sentence2

'Level up with the greatest and best AI & ML community. Join over 18M+ machine learners.'

In [16]:
# Tokenize both sentences with padding
tokenizer_out2 = tokenizer([sentence, sentence2], padding=True)
tokenizer_out2

{'input_ids': [[101, 9583, 1146, 1114, 1103, 2026, 19016, 111, 150, 2162, 1661, 119, 8125, 1394, 1166, 1407, 2107, 116, 3395, 3858, 1468, 119, 102, 0, 0], [101, 9583, 1146, 1114, 1103, 4459, 1105, 1436, 19016, 111, 150, 2162, 1661, 119, 8125, 1394, 1166, 1407, 2107, 116, 3395, 3858, 1468, 119, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [17]:
# Decode the tokenized input_ids for both sentences
tokenizer.decode(tokenizer_out2["input_ids"][0])

'[CLS] Level up with the largest AI & ML community. Join over 18M + machine learners. [SEP] [PAD] [PAD]'

In [18]:
tokenizer.decode(tokenizer_out2["input_ids"][1])

'[CLS] Level up with the greatest and best AI & ML community. Join over 18M + machine learners. [SEP]'