<a href="https://colab.research.google.com/github/KhotNoorin/Natural-Language-Processing/blob/main/Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization:

Tokenization is the process of breaking down a given text into smaller units called *tokens*.  
Tokens can be words, subwords, characters, or even sentences, depending on the type of tokenization used.  
This is a fundamental step in Natural Language Processing (NLP) as it transforms raw text into a structured form suitable for machine learning models.

---

## Why Tokenization is Important
1. **Preprocessing**: Models cannot work directly with raw text; they require numerical input.
2. **Structure Extraction**: Breaking text into tokens helps identify meaningful units.
3. **Language Understanding**: Tokens are the building blocks for further NLP tasks like parsing, sentiment analysis, and translation.
4. **Efficiency**: Reduces complexity and size of input for algorithms.

---

## Types of Tokenization

### 1. **Word Tokenization**
- Splits text into individual words.
- Example:  
  Text: "Natural Language Processing is amazing."  
  Tokens: ["Natural", "Language", "Processing", "is", "amazing", "."]
- **Pros**: Simple, intuitive.  
- **Cons**: Fails with complex languages and compound words.

### 2. **Sentence Tokenization**
- Splits text into sentences.
- Example:  
  Text: "I love NLP. It is very useful."  
  Tokens: ["I love NLP.", "It is very useful."]
- Useful for tasks like summarization and translation.

### 3. **Subword Tokenization**
- Breaks words into smaller units (subwords) based on frequency or patterns.
- Example (using BPE - Byte Pair Encoding):  
  "unhappiness" → ["un", "happi", "ness"]
- **Pros**: Handles unknown words and rare vocabulary better.
- Commonly used in Transformer-based models like BERT, GPT.

### 4. **Character Tokenization**
- Each character becomes a token.
- Example: "chat" → ["c", "h", "a", "t"]
- **Pros**: Useful for languages without spaces (Chinese, Japanese).
- **Cons**: Sequence length increases significantly.

---

## Popular Tokenization Techniques and Libraries
1. **NLTK**  
   - Functions like word_tokenize and sent_tokenize.
2. **spaCy**  
   - Fast and accurate tokenization for multiple languages.
3. **Hugging Face Tokenizers**  
   - Specialized for modern NLP models.
4. **Byte Pair Encoding (BPE)**  
   - Used in GPT, RoBERTa.
5. **WordPiece**  
   - Used in BERT.
6. **SentencePiece**  
   - Language-independent, used in T5, XLNet.

---

## Challenges in Tokenization
1. **Ambiguity in text**:  
   - Example: "New York-based" — split as ["New", "York-based"] or ["New", "York", "-", "based"]?
2. **Language-specific rules**:  
   - Different rules for spacing, punctuation, and compound words.
3. **Special characters and emojis**:  
   - Need for proper handling in social media data.
4. **Multilingual texts**:  
   - Tokenizer must adapt to multiple scripts and alphabets.


# Using NLTK

In [1]:
import nltk

In [6]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [7]:
from nltk.tokenize import word_tokenize

In [8]:
text = "I love learning NLP with Python!"

In [9]:
tokens = word_tokenize(text)

In [10]:
print(tokens)

['I', 'love', 'learning', 'NLP', 'with', 'Python', '!']


# Using spaCy

In [11]:
import spacy

In [12]:
nlp = spacy.load("en_core_web_sm")

In [13]:
doc = nlp("I love learning NLP with Python!")
tokens = [token.text for token in doc]

In [14]:
print(tokens)

['I', 'love', 'learning', 'NLP', 'with', 'Python', '!']


# Using Hugging Face Tokenizers

In [15]:
from transformers import BertTokenizer

In [16]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [17]:
tokens = tokenizer.tokenize("I love learning NLP with Python!")

In [18]:
print(tokens)

['i', 'love', 'learning', 'nl', '##p', 'with', 'python', '!']


# without any library

In [20]:
text = "I love learning NLP with Python!"

In [21]:
# Simple Word Tokenization

tokens = text.split()
print(tokens)

['I', 'love', 'learning', 'NLP', 'with', 'Python!']


In [22]:
# Using Regular Expressions

import re

tokens = re.findall(r"\b\w+\b", text)
print(tokens)

['I', 'love', 'learning', 'NLP', 'with', 'Python']


In [23]:
# Character Tokenization

text = "NLP"
tokens = list(text)
print(tokens)

['N', 'L', 'P']


In [24]:
# Sentence Tokenization

text = "I love NLP. It is fun! Do you like it?"
sentences = re.split(r'(?<=[.!?]) +', text)
print(sentences)

['I love NLP.', 'It is fun!', 'Do you like it?']


In [26]:
import re

def sentence_tokenize(text):
    """
    Splits text into sentences.
    """
    # Handles ., !, ?, and keeps punctuation at sentence end
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    return [s for s in sentences if s]

def word_tokenize(text):
    """
    Splits text into tokens (words, numbers, punctuation).
    Keeps contractions like "don't" and supports multilingual text.
    """
    tokens = re.findall(r"[A-Za-z]+(?:'[A-Za-z]+)?|\d+|[^\w\s]|[\u4e00-\u9fff]+",
                        text, re.UNICODE)
    return tokens

# Example usage
text = "I love NLP! It's amazing, isn't it? 123. Let's go."
print("Sentences:", sentence_tokenize(text))
print("Tokens:", word_tokenize(text))

Sentences: ['I love NLP!', "It's amazing, isn't it?", '123.', "Let's go."]
Tokens: ['I', 'love', 'NLP', '!', "It's", 'amazing', ',', "isn't", 'it', '?', '123', '.', "Let's", 'go', '.']
