#Word Tokenization
splitting text into smaller meaningful units (tokens) before feeding into a model.
1. Large vocabulary (each unique word = new token)
2. Can‚Äôt handle unseen words (OOV = ‚Äúout of vocabulary‚Äù)
3. Misses sub-word meaning (e.g. ‚ÄúTransformers‚Äù, ‚Äútransforming‚Äù, ‚Äútransformation‚Äù are all separate)

**Used by NLTK, Spacy, TextBlob, Most Classical NLP Pipelines**

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("he company spent $30,000,000 last year.")

for token in doc:
  print(token)

he
company
spent
$
30,000,000
last
year
.


#Rule Based Tokenization
Rule-based tokenization is a common method which has predefined rules based on whitespace, punctuation or patterns. Such as WhiteSpace Tokenization, Regular Expression Based Tokenization, Punctuation Based tokenization, Hybrid Tokenization

#Subword Tokenization
Breaks words into smaller meaningful units (subwords), not just spaces. It's main principal is Frequent words should be kept as a single token, while rare words should be broken down into meaningful subwords.


Example (BERT / WordPiece / Byte Pair Encoding):

"I‚Äôm learning NLP with Transformers!"
‚Üí ["i", "‚Äô", "m", "learning", "nl", "##p", "with", "transform", "##ers", "!"]

‚úÖ Pros:

Much smaller vocabulary (~30K tokens)

Can handle new or rare words

Keeps semantic parts of words (prefix/suffix/stem)

‚ùå Cons:

Tokens are less interpretable

Requires special decoder to reconstruct text

üì¶ Used by:

BERT (WordPiece)

GPT / LLaMA / T5 (Byte Pair Encoding)

DistilBERT, RoBERTa, etc.

Subword tokenization is a method between traditional word-level and character level tokenization.

#Subword Tokenization
‚îú‚îÄ‚îÄ **Byte Pair Encoding (BPE)**

‚îÇ   ‚îú‚îÄ‚îÄ Used by: GPT-2, GPT-3, RoBERTa, XLM

‚îÇ   ‚îî‚îÄ‚îÄ Variant: Byte-level BPE (used by GPT-2)
‚îÇ
 #     **WordPiece**
‚îÇ   ‚îú‚îÄ‚îÄ Used by: BERT, DistilBERT, Electra

‚îÇ   ‚îî‚îÄ‚îÄ Algorithm type: Maximum likelihood merging
‚îÇ
# **Unigram Language Model**
‚îÇ   ‚îú‚îÄ‚îÄ Used by: SentencePiece, ALBERT, T5

‚îÇ   ‚îî‚îÄ‚îÄ Algorithm type: Probability-based pruning

#‚îÄ‚îÄ **SentencePiece**

    ‚îú‚îÄ‚îÄ Not an algorithm itself, but a framework
    ‚îî‚îÄ‚îÄ Can implement BPE, Unigram, etc.


#Character Level Tokenization
üëâ Character-level tokenization means we break text down into individual characters, not words or subwords. It‚Äôs robust and simple, but needs more computation and deeper models to capture meaning.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(char_level=True)

text = ["Deep Learning is Fun", "NLP is Fun"]

tokenizer.fit_on_texts(text)

print(tokenizer.word_index) #Each unique character in your text is assigned a unique integer ID.

print(tokenizer.texts_to_sequences(text))


{' ': 1, 'n': 2, 'e': 3, 'i': 4, 'p': 5, 'l': 6, 's': 7, 'f': 8, 'u': 9, 'd': 10, 'a': 11, 'r': 12, 'g': 13}
[[10, 3, 3, 5, 1, 6, 3, 11, 12, 2, 4, 2, 13, 1, 4, 7, 1, 8, 9, 2], [2, 6, 5, 1, 4, 7, 1, 8, 9, 2]]


# Byte-pair Encoding
Byte Pair Encoding (BPE) is a data compression algorithm that was adapted for NLP tokenization. Its core idea is:

Iteratively merge the most frequent pair of adjacent characters or tokens into a new, single token.

#["low", "lowest", "newer", "wider"]
#Tokens
l o w
l o w e s t
n e w e r
w i d e r
#Vocabulary
['l', 'o', 'w', 'e', 's', 't', 'n', 'r', 'i', 'd']

#Count most frequent pairs

Find the most frequent adjacent pair of symbols.
Suppose 'l' + 'o' occurs often ‚Üí 'lo'
Merge them.
#New Tokens
lo w

lo w e s t

n e w e r

w i d e r

Find Next most frequent pair and merge them. Continue like this untill find the desired vocabulary size

#Final
low, est, new, er, wid, er


In [None]:
from tokenizers import Tokenizer, models, pre_tokenizers, trainers

tokenizer = Tokenizer(models.BPE())

tokenizer.pre_tokenizer = pre_tokenizers.Whitespace() #Before learning merges, split text into tokens using whitespace

''' | PreTokenizer   | What it does                          |
| -------------- | ------------------------------------- |
| `Whitespace()` | Splits on spaces                      |
| `ByteLevel()`  | Works at byte-level (used by GPT-2)   |
| `Metaspace()`  | Replaces spaces with a visible symbol |
| `Digits()`     | Splits digits separately              |
 '''

trainer = trainers.BpeTrainer(
    vocab_size=100, #Max number of tokens in final vocab ((characters + merged subwords))
    min_frequency=2, #gnore rare pairs that appear less than this number
    show_progress=True #Display Merge Progress Bar
)
texts = ["low", "lowest", "newer", "wider"]

tokenizer.train_from_iterator(texts, trainer= trainer)
print(tokenizer.get_vocab())

encode = tokenizer.encode("lowest")

print(encode.tokens)
print(encode.ids)

{'t': 8, 'er': 10, 'l': 3, 'e': 1, 'd': 0, 'n': 4, 'o': 5, 's': 7, 'i': 2, 'low': 12, 'r': 6, 'w': 9, 'lo': 11}
['low', 'e', 's', 't']
[12, 1, 7, 8]


#WordPiece
WordPiece is a subword tokenization algorithm that builds its vocabulary by iteratively merging pairs that maximize the likelihood of the training data.  Merges the most frequent pair but  WordPiece Merges the pair that maximizes language model likelihood. Used by BERT, DistilBERT , AlBERT
#Formula
`score(pair) = frequency(pair) / (frequency(first_token) √ó frequency(second_token))`

This means WordPiece prefers to merge pairs where the tokens appear together much more often than they appear separately.

#Steps:
Let's say your corpus is: play, player, playing, played

Start With Characters p, l, a, y, e, r, i, n, g

Count Frequency of pairs (p, l), (l, a), (a, y), (p, l), (l, a), (a, y), ...

merge probabal pair p + l = "pl"

Continues untill reached desired vocab size
["p", "pl", "play", "player", "playing", "played"]


In [None]:
tokenizer = Tokenizer(models.WordPiece())

tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

trainer = trainers.WordPieceTrainer(
    vocab_size =100,
    min_frequency=2,
    show_progress=True,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)

text = [
    "playing", "player", "players", "played", "playful",
    "unhappy", "unhealthy", "uncertain", "unlikely",
    "happily", "happiness", "happy", "unhappily"
]

tokenizer.train_from_iterator(text,trainer=trainer)

print(tokenizer.get_vocab())

output = tokenizer.encode("unbelievable players are playing")
print(output.tokens)

{'##p': 25, '##s': 37, '##in': 47, '##h': 23, 'e': 8, 'c': 6, '[SEP]': 3, 'play': 43, 'l': 14, '##ly': 46, 'g': 10, 'd': 7, '##ap': 41, '##i': 28, 'n': 15, '[UNK]': 1, '##g': 29, '##f': 35, 'i': 12, '[CLS]': 2, 't': 19, 'f': 9, '##k': 38, 'p': 16, 'unh': 49, 's': 18, '##u': 36, '##ay': 42, 'happ': 45, '##n': 22, 'a': 5, 'u': 20, '##er': 48, 'y': 21, '##a': 24, '##r': 34, '##app': 44, 'r': 17, 'player': 51, '##t': 32, '##y': 26, '##d': 31, '##l': 27, 'unhapp': 52, 'h': 11, '##e': 30, '[MASK]': 4, 'un': 40, 'k': 13, '##ily': 50, '[PAD]': 0, 'pl': 39, '##c': 33}
['[UNK]', 'player', '##s', 'a', '##r', '##e', 'play', '##in', '##g']


In [1]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

print(tokenizer.tokenize("unbelievable players are playing"))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

['unbelievable', 'players', 'are', 'playing']


In [7]:
import pandas as pd
import numpy as np

df = pd.read_csv('/content/spam.csv')

df.head(),df.shape

(  Category                                            Message
 0      ham  Go until jurong point, crazy.. Available only ...
 1      ham                      Ok lar... Joking wif u oni...
 2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
 3      ham  U dun say so early hor... U c already then say...
 4      ham  Nah I don't think he goes to usf, he lives aro...,
 (5572, 2))

In [11]:
text =[]

for i in df['Message'][:100]:
  text.append(i)



In [17]:
tokens = [tokenizer.tokenize(t) for t in text]
print(tokens)


[['go', 'until', 'ju', '##rong', 'point', ',', 'crazy', '.', '.', 'available', 'only', 'in', 'bug', '##is', 'n', 'great', 'world', 'la', 'e', 'buffet', '.', '.', '.', 'ci', '##ne', 'there', 'got', 'amore', 'wat', '.', '.', '.'], ['ok', 'la', '##r', '.', '.', '.', 'joking', 'wi', '##f', 'u', 'on', '##i', '.', '.', '.'], ['free', 'entry', 'in', '2', 'a', 'w', '##k', '##ly', 'com', '##p', 'to', 'win', 'fa', 'cup', 'final', 't', '##kt', '##s', '21st', 'may', '2005', '.', 'text', 'fa', 'to', '87', '##12', '##1', 'to', 'receive', 'entry', 'question', '(', 'st', '##d', 'tx', '##t', 'rate', ')', 't', '&', 'c', "'", 's', 'apply', '08', '##45', '##28', '##100', '##75', '##over', '##18', "'", 's'], ['u', 'dun', 'say', 'so', 'early', 'ho', '##r', '.', '.', '.', 'u', 'c', 'already', 'then', 'say', '.', '.', '.'], ['nah', 'i', 'don', "'", 't', 'think', 'he', 'goes', 'to', 'us', '##f', ',', 'he', 'lives', 'around', 'here', 'though'], ['free', '##ms', '##g', 'hey', 'there', 'darling', 'it', "'", 's', 