<a href="https://colab.research.google.com/github/Ehtisham1053/Natural-Language-Processing/blob/main/Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1Ô∏è‚É£ What is Tokenization?
Tokenization is the process of splitting text into smaller units, called tokens. These tokens can be words, subwords, sentences, or characters.

##üìå Example
* Input Text:
üëâ "Natural Language Processing is amazing!"

* After Tokenization (Word-based)
üëâ ["Natural", "Language", "Processing", "is", "amazing", "!"]

##2Ô∏è‚É£ Why Use Tokenization?
Tokenization is a fundamental step in NLP because:

* Standardizes input data (splitting text into meaningful parts).
* Prepares text for further processing (vectorization, embeddings, etc.).
* Helps models understand structure (sentences, words, etc.).
* Reduces computational complexity by breaking down large text into smaller parts.

##3Ô∏è‚É£ When to Use Tokenization?
Tokenization is used in various NLP tasks, such as: ‚úÖ Preprocessing for Machine Learning Models (Text Classification, Sentiment Analysis, etc.)
* ‚úÖ Information Retrieval & Search Engines (Splitting text into searchable units)
* ‚úÖ Chatbots & Conversational AI (Understanding input sentences)
* ‚úÖ Text Summarization & Translation (Breaking text into manageable parts)
* ‚úÖ Speech-to-Text Processing (Segmenting spoken text into meaningful words)

# 4Ô∏è‚É£ Types of Tokenization

## üîπ 1. Word Tokenization

In [4]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

text = "I love NLP! It's amazing."
tokens = word_tokenize(text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [5]:
print("original text" , text)
print("tokens" , tokens)

original text I love NLP! It's amazing.
tokens ['I', 'love', 'NLP', '!', 'It', "'s", 'amazing', '.']


üîπ Challenges:

*  Doesn't handle contractions well (It's ‚Üí ['It', "'s"]).
*  Punctuation is treated as separate tokens.

## üîπ 2. Sentence Tokenization

In [6]:
from nltk.tokenize import sent_tokenize

text = "NLP is great. It helps machines understand language!"
sentences = sent_tokenize(text)


In [7]:
print(sentences)


['NLP is great.', 'It helps machines understand language!']


üîπ Challenges:

* Hard to distinguish abbreviations (Dr. Smith is here. vs Dr. Smith is here).
* Some languages don‚Äôt use punctuation to separate sentences.

## üîπ 3. Character Tokenization

In [8]:
text = "Hello!"
tokens = list(text)
print(tokens)

['H', 'e', 'l', 'l', 'o', '!']


üîπ Use Case:

* Used in models like RNNs and Transformer-based models that process individual characters.
* Useful in OCR (Optical Character Recognition) tasks.

## üîπ 4. Subword Tokenization (Byte Pair Encoding - BPE)
Instead of splitting at spaces, it breaks words into smaller subwords.
* ‚úÖ Handles rare words better.
* ‚úÖ Used in transformer-based models (BERT, GPT, etc.)

1. "unhappiness" ‚Üí ["un", "happiness"]
3. "playing" ‚Üí ["play", "ing"]


In [10]:
!pip install bpemb

Collecting bpemb
  Downloading bpemb-0.3.6-py3-none-any.whl.metadata (19 kB)
Collecting gensim (from bpemb)
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim->bpemb)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m60.6/60.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Downloading bpemb-0.3.6-py3-none-any.whl (20 kB)
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m26.7/26.7 MB[0m [31m64.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 

In [12]:
from bpemb import BPEmb

bpemb_en = BPEmb(lang="en", vs=10000)

downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 400869/400869 [00:00<00:00, 920631.31B/s]


downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3784656/3784656 [00:00<00:00, 4121771.65B/s]


In [13]:
model_file = bpemb_en.model_file

In [15]:
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load(str(model_file))
print(sp.EncodeAsPieces("unhappiness playing"))

['‚ñÅun', 'h', 'app', 'iness', '‚ñÅplaying']


###üîπ Use Case:

* Machine Translation
* Text Generation
* Named Entity Recognition
###üîπ Challenges:

* Requires a pre-trained model to learn subwords.
*Not always intuitive (splitting words into non-intuitive parts).