#Tokenization with BERT: A Visual Exploration

Our focus here is on understanding how the BERT (Bidirectional Encoder Representations from Transformers) model, particularly its base uncased version, processes and prepares textual data for deep learning tasks.

**Step 1:** Setup and Library Importation

In [None]:
pip install transformers

In [None]:
from transformers import AutoTokenizer
import pandas as pd

**Step 2:** Select Text Samples

In [None]:
text_samples = [
    "Artificial intelligence transforms computational paradigms with neural networks.",
    "The principles of thermodynamics underpin many engineering solutions.",
    "Blockchain technology promises to revolutionize data security in software engineering.",
    "Reinforcement learning enables autonomous systems to learn from interactions.",
    "Quantum computing could potentially solve complex problems faster than classical computers.",
    "Robotic automation in manufacturing accelerates production efficiency and precision.",
    "The integration of IoT devices enhances smart city infrastructure management.",
    "Deep learning algorithms require substantial data for model training and feature extraction.",
    "Finite element analysis is pivotal in structural engineering for stress testing.",
    "Natural language processing facilitates human-computer interaction in AI applications."
]


**Step 3:** Tokenization with a Pre-Trained Model Tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

**Step 4:** Visualizing Tokenization Output

In [None]:
tokenization_results = []

for text in text_samples:
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    tokenization_results.append({
        "Original Text": text,
        "Tokens": tokens,
        "Token IDs": token_ids
    })


df = pd.DataFrame(tokenization_results)


In [None]:
pd.set_option('display.max_colwidth', None)
display(df[['Original Text', 'Tokens', 'Token IDs']])


Unnamed: 0,Original Text,Tokens,Token IDs
0,Artificial intelligence transforms computational paradigms with neural networks.,"[artificial, intelligence, transforms, computational, paradigm, ##s, with, neural, networks, .]","[7976, 4454, 21743, 15078, 20680, 2015, 2007, 15756, 6125, 1012]"
1,The principles of thermodynamics underpin many engineering solutions.,"[the, principles, of, the, ##rm, ##od, ##yna, ##mics, under, ##pin, many, engineering, solutions, .]","[1996, 6481, 1997, 1996, 10867, 7716, 18279, 22924, 2104, 8091, 2116, 3330, 7300, 1012]"
2,Blockchain technology promises to revolutionize data security in software engineering.,"[block, ##chai, ##n, technology, promises, to, revolution, ##ize, data, security, in, software, engineering, .]","[3796, 24925, 2078, 2974, 10659, 2000, 4329, 4697, 2951, 3036, 1999, 4007, 3330, 1012]"
3,Reinforcement learning enables autonomous systems to learn from interactions.,"[reinforcement, learning, enables, autonomous, systems, to, learn, from, interactions, .]","[23895, 4083, 12939, 8392, 3001, 2000, 4553, 2013, 10266, 1012]"
4,Quantum computing could potentially solve complex problems faster than classical computers.,"[quantum, computing, could, potentially, solve, complex, problems, faster, than, classical, computers, .]","[8559, 9798, 2071, 9280, 9611, 3375, 3471, 5514, 2084, 4556, 7588, 1012]"
5,Robotic automation in manufacturing accelerates production efficiency and precision.,"[robotic, automation, in, manufacturing, accelerate, ##s, production, efficiency, and, precision, .]","[20478, 19309, 1999, 5814, 23306, 2015, 2537, 8122, 1998, 11718, 1012]"
6,The integration of IoT devices enhances smart city infrastructure management.,"[the, integration, of, io, ##t, devices, enhance, ##s, smart, city, infrastructure, management, .]","[1996, 8346, 1997, 22834, 2102, 5733, 11598, 2015, 6047, 2103, 6502, 2968, 1012]"
7,Deep learning algorithms require substantial data for model training and feature extraction.,"[deep, learning, algorithms, require, substantial, data, for, model, training, and, feature, extraction, .]","[2784, 4083, 13792, 5478, 6937, 2951, 2005, 2944, 2731, 1998, 3444, 14676, 1012]"
8,Finite element analysis is pivotal in structural engineering for stress testing.,"[finite, element, analysis, is, pivotal, in, structural, engineering, for, stress, testing, .]","[10713, 5783, 4106, 2003, 20369, 1999, 8332, 3330, 2005, 6911, 5604, 1012]"
9,Natural language processing facilitates human-computer interaction in AI applications.,"[natural, language, processing, facilitates, human, -, computer, interaction, in, ai, applications, .]","[3019, 2653, 6364, 27777, 2529, 1011, 3274, 8290, 1999, 9932, 5097, 1012]"
