<h1 style="background-color: #f8f0fa;
            border-left: 5px solid #1b4332;
            font-family: 'Trebuchet MS', sans-serif;
            border-right: 5px solid #1b4332;
            padding: 12px;
            border-radius: 50px 50px;
            color: #1b4332;
            text-align:center;
            font-size:45px;"><strong>😊Tokenization🌟</strong></h1>
<hr style="border-top: 5px solid #264653;">

## Introduction
Tokenization is a fundamental step in Natural Language Processing (NLP) where a text string is split into smaller units called tokens. These tokens can be words, phrases, or even characters, depending on the application. 

---

## Why is Tokenization Important?
1. Converts raw text into a format that can be processed by models.
2. Preserves the semantic meaning of text.
3. Acts as the first step in most NLP pipelines.

---

## Types of Tokenization
1. **Word Tokenization**
   - Splits text into individual words.
   - Example: "Natural Language Processing" → ["Natural", "Language", "Processing"]

2. **Sentence Tokenization**
   - Splits text into sentences.
   - Example: "Hello World. Welcome to NLP." → ["Hello World.", "Welcome to NLP."]

3. **Subword Tokenization**
   - Breaks words into smaller subunits.
   - Example: "unbelievable" → ["un", "believ", "able"]

4. **Character Tokenization**
   - Splits text into individual characters.
   - Example: "NLP" → ["N", "L", "P"]

---

## Popular Libraries for Tokenization
1. **NLTK (Natural Language Toolkit)**
   - `word_tokenize`: For word-level tokenization.
   - `sent_tokenize`: For sentence-level tokenization.

2. **spaCy**
   - Offers efficient and customizable tokenization.
   - Provides linguistic features like Part-of-Speech tagging with tokens.

3. **Hugging Face Tokenizers**
   - Supports modern tokenization techniques for transformers like BERT, GPT, etc.
   - Provides subword tokenization and encoding.

4. **Regex-based Tokenization**
   - Custom tokenization using Python's `re` library.

---

## Implementation Examples

### 1. Word Tokenization with NLTK

In [5]:
import warnings 
warnings.filterwarnings('ignore')

In [6]:
import nltk

nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\hassa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [7]:
print("NLTK Data Path:", nltk.data.path)

NLTK Data Path: ['C:\\Users\\hassa/nltk_data', 'c:\\python\\python3126\\nltk_data', 'c:\\python\\python3126\\share\\nltk_data', 'c:\\python\\python3126\\lib\\nltk_data', 'C:\\Users\\hassa\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', 'D:\\nltk_data', 'E:\\nltk_data']


In [11]:
from nltk.tokenize import word_tokenize

text = "Hello everyone. my name is hassane. I'm 20 years old. I'm from Morroco."

sentences = word_tokenize(text)
print(sentences)

['Hello', 'everyone', '.', 'my', 'name', 'is', 'hassane', '.', 'I', "'m", '20', 'years', 'old', '.', 'I', "'m", 'from', 'Morroco', '.']


### 2. Sentence Tokenization with NLTK

In [8]:
from nltk.tokenize import sent_tokenize

text = "Hello everyone. my name is hassane. I'm 20 years old. I'm from Morroco."

sentences = sent_tokenize(text)
print(sentences)

['Hello everyone.', 'my name is hassane.', "I'm 20 years old.", "I'm from Morroco."]


### 3. Tokenization with spaCy

Install the module
```python 
python -m spacy download en_core_web_sm
```

In [9]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello my name is hassane skikri.")
tokens = [token.text for token in doc]
print(tokens)

['Hello', 'my', 'name', 'is', 'hassane', 'skikri', '.']


### 4.Custom Tokenization with Regex

In [10]:
import re

text = "Tokenization in NLP!"
tokens = re.findall(r"\b\w+\b", text)
print(tokens)

['Tokenization', 'in', 'NLP']


---

## Challenges in Tokenization
1. **Ambiguity in Languages**
   - Some languages (e.g., Chinese, Japanese) do not use spaces to separate words.
2. **Handling Special Characters**
   - Symbols, hashtags, and URLs need special treatment.
3. **Language-Specific Rules**
   - Grammar and semantics vary across languages, affecting tokenization.

---

## Advanced Tokenization Techniques
1. **Byte Pair Encoding (BPE)**
   - Used in models like GPT, BERT.
   - Combines frequently occurring sequences of characters into subwords.

2. **Unigram Language Models**
   - Probabilistic approach for subword segmentation.

3. **SentencePiece**
   - Library for unsupervised text tokenization and subword segmentation.

---

## Tips for Effective Tokenization
1. Choose the right tokenization technique for your use case.
2. Preprocess text (lowercasing, removing stop words) before tokenization if needed.
3. Use libraries like Hugging Face for modern tokenization needs.

---

## Real-world Applications
1. **Search Engines**
   - Tokenization helps in indexing and retrieving documents.
2. **Chatbots**
   - Tokenization enables better understanding of user queries.
3. **Machine Translation**
   - Subword tokenization improves translation quality.
4. **Text Summarization**
   - Sentence tokenization aids in extracting meaningful summaries.

---

## Conclusion
Tokenization is a critical step in NLP that transforms raw text into meaningful units. By mastering different tokenization techniques and tools, you can significantly enhance your NLP projects.

