# <font color = 'Yellow'> Tokenization in Natural Language Processing (NLP) </font>

Tokenization in Natural Language Processing (NLP) is the process of breaking down text into smaller units called tokens. These tokens can be words, subwords, or even individual characters. It's a foundational step in NLP tasks, as it helps convert raw text into a format that can be processed by algorithms.

There are different types of tokenization:
1. **`Word Tokenization`** : Splitting text into individual words.
2. **`Subword Tokenization`** : Breaking words into smaller meaningful units, often used in models like BERT.
3. **`Character Tokenization`** : Treating each character as a token, useful for languages with complex scripts.



## Import Required Libiraries

In [27]:
# Before start doing Tokenization, need to install using following commands, if you installed already, then ignore
# ! pip uninstall nltk
# ! pip install nltk
# Above line will uninistal and install nltk

! pip install transformers


Collecting transformers
  Downloading transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.1-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.5.3-cp38-abi3-win_amd64.whl.metadata (3.9 kB)
Downloading transformers-4.51.3-py3-none-any.whl (10.4 MB)
   ---------------------------------------- 0.0/10.4 MB ? eta -:--:--
   ------------------- -------------------- 5.0/10.4 MB 30.2 MB/s eta 0:00:01
   ---------------------------------------  10.2/10.4 MB 29.0 MB/s eta 0:00:01
   ---------------------------------------- 10.4/10.4 MB 24.9 MB/s eta 0:00:00
Downloading huggingface_hub-0.30.2-py3-none-any.whl (481 kB)
Downloading safetensors-0.5.3-cp38-abi3-win_amd64.whl (308 kB)
Downloading tokenizers-0.21.1-cp39-a

## <font color = 'Yellow'>Tokenization</font>

In [18]:
myword = '''Tokenization in Natural Language Processing (NLP) is the process of breaking down text into smaller units called tokens.
These tokens can be words, subwords, or even individual characters.
It's a foundational step in NLP tasks, as it helps convert raw text into a format that can be processed by algorithms.
'''

print(myword)


Tokenization in Natural Language Processing (NLP) is the process of breaking down text into smaller units called tokens.
These tokens can be words, subwords, or even individual characters.
It's a foundational step in NLP tasks, as it helps convert raw text into a format that can be processed by algorithms.



### Sentence -> Paragraphs

In [24]:
from nltk.tokenize import sent_tokenize

para = sent_tokenize(myword)
print("Sentence Tokens:", para)

Sentence Tokens: ['Tokenization in Natural Language Processing (NLP) is the process of breaking down text into smaller units called tokens.', 'These tokens can be words, subwords, or even individual characters.', "It's a foundational step in NLP tasks, as it helps convert raw text into a format that can be processed by algorithms."]


In [21]:
type(para)

list

In [22]:
for sentence in para:
    print(sentence)

Tokenization in Natural Language Processing (NLP) is the process of breaking down text into smaller units called tokens.
These tokens can be words, subwords, or even individual characters.
It's a foundational step in NLP tasks, as it helps convert raw text into a format that can be processed by algorithms.


## <font color = 'yellow'> Word Tokenization (Using NLTK) </font>

In [25]:
from nltk.tokenize import word_tokenize

# Word Tokenization
tokens = word_tokenize(myword)
print("Word Tokens:", tokens)


Word Tokens: ['Tokenization', 'in', 'Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'the', 'process', 'of', 'breaking', 'down', 'text', 'into', 'smaller', 'units', 'called', 'tokens', '.', 'These', 'tokens', 'can', 'be', 'words', ',', 'subwords', ',', 'or', 'even', 'individual', 'characters', '.', 'It', "'s", 'a', 'foundational', 'step', 'in', 'NLP', 'tasks', ',', 'as', 'it', 'helps', 'convert', 'raw', 'text', 'into', 'a', 'format', 'that', 'can', 'be', 'processed', 'by', 'algorithms', '.']


## <font color = 'yellow'> Subword Tokenization (Using Hugging Face Transformers) </font>

In [28]:
from transformers import AutoTokenizer

# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize into subwords
subword_tokens = tokenizer.tokenize(myword)
print("Subword Tokens:", subword_tokens)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Subword Tokens: ['token', '##ization', 'in', 'natural', 'language', 'processing', '(', 'nl', '##p', ')', 'is', 'the', 'process', 'of', 'breaking', 'down', 'text', 'into', 'smaller', 'units', 'called', 'token', '##s', '.', 'these', 'token', '##s', 'can', 'be', 'words', ',', 'sub', '##words', ',', 'or', 'even', 'individual', 'characters', '.', 'it', "'", 's', 'a', 'foundation', '##al', 'step', 'in', 'nl', '##p', 'tasks', ',', 'as', 'it', 'helps', 'convert', 'raw', 'text', 'into', 'a', 'format', 'that', 'can', 'be', 'processed', 'by', 'algorithms', '.']


## <font color = 'yellow'> Character Tokenization <font>

In [29]:
# Tokenize into characters
character_tokens = list(myword)
print("Character Tokens:", character_tokens)


Character Tokens: ['T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 'n', ' ', 'N', 'a', 't', 'u', 'r', 'a', 'l', ' ', 'L', 'a', 'n', 'g', 'u', 'a', 'g', 'e', ' ', 'P', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', ' ', '(', 'N', 'L', 'P', ')', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'p', 'r', 'o', 'c', 'e', 's', 's', ' ', 'o', 'f', ' ', 'b', 'r', 'e', 'a', 'k', 'i', 'n', 'g', ' ', 'd', 'o', 'w', 'n', ' ', 't', 'e', 'x', 't', ' ', 'i', 'n', 't', 'o', ' ', 's', 'm', 'a', 'l', 'l', 'e', 'r', ' ', 'u', 'n', 'i', 't', 's', ' ', 'c', 'a', 'l', 'l', 'e', 'd', ' ', 't', 'o', 'k', 'e', 'n', 's', '.', '\n', 'T', 'h', 'e', 's', 'e', ' ', 't', 'o', 'k', 'e', 'n', 's', ' ', 'c', 'a', 'n', ' ', 'b', 'e', ' ', 'w', 'o', 'r', 'd', 's', ',', ' ', 's', 'u', 'b', 'w', 'o', 'r', 'd', 's', ',', ' ', 'o', 'r', ' ', 'e', 'v', 'e', 'n', ' ', 'i', 'n', 'd', 'i', 'v', 'i', 'd', 'u', 'a', 'l', ' ', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 's', '.', '\n', 'I', 't', "'", 's', ' ', 'a', ' ',