# **Tokenization in NLP**

## **What is Tokenization?**
Tokenization is the process of splitting text into **smaller meaningful units**, called **tokens**. These tokens can be **words**, **phrases**, or **sentences**. It is one of the fundamental steps in **Natural Language Processing (NLP)**.

---

## **Types of Tokenization**
1. **Word Tokenization**  
   - Splits text into individual words.  
   - Example:  
     **Text:** `"Hello! How are you?"`  
     **Word Tokens:** `["Hello", "!", "How", "are", "you", "?"]`

2. **Sentence Tokenization**  
   - Splits text into complete sentences.  
   - Example:  
     **Text:** `"Hello! How are you? Have a great day."`  
     **Sentence Tokens:** `["Hello!", "How are you?", "Have a great day."]`

---

## **Why is Tokenization Important?**
✔ **Prepares text for further processing** (stopword removal, stemming, lemmatization).  
✔ **Used in chatbots & AI assistants** to understand user queries.  
✔ **Improves search engines** by indexing words efficiently.  
✔ **Essential for text classification & sentiment analysis.**  

---

## **Challenges in Tokenization**
❌ **Handling contractions** (e.g., `"I'm"` should be `"I am"`).  
❌ **Recognizing multi-word expressions** (e.g., `"New York"` should be one token).  
❌ **Language-specific rules** (tokenization varies across languages).  
❌ **Dealing with punctuation** (e.g., `"U.S.A."` vs `"USA"`).  

---

## **Libraries for Tokenization**
- **NLTK** (Natural Language Toolkit) - Basic NLP tasks.  
- **spaCy** - Faster and optimized for large-scale NLP.  
- **Regex-based Tokenization** - Custom tokenization using patterns.  

Tokenization is the **first step** in NLP, setting the foundation for text analysis, machine learning models, and AI applications.


In [1]:
import nltk
#nltk.download('all')
from nltk.tokenize import sent_tokenize

In [2]:
corpus=""" Hello! I'm testing tokenization. Let's see how it works.
NLTK's tokenizer is quite powerful—it's useful for NLP tasks.
Python 3.9 is great, isn't it? What about e-mail addresses like test@example.com?
Hey... check this out: $50, 100%, and U.S.A. are tricky cases. """

In [3]:
print(corpus)

 Hello! I'm testing tokenization. Let's see how it works.
NLTK's tokenizer is quite powerful—it's useful for NLP tasks.
Python 3.9 is great, isn't it? What about e-mail addresses like test@example.com?
Hey... check this out: $50, 100%, and U.S.A. are tricky cases. 


## Sentence Tokenization

In [4]:
documents=sent_tokenize(corpus)
documents

[' Hello!',
 "I'm testing tokenization.",
 "Let's see how it works.",
 "NLTK's tokenizer is quite powerful—it's useful for NLP tasks.",
 "Python 3.9 is great, isn't it?",
 'What about e-mail addresses like test@example.com?',
 'Hey... check this out: $50, 100%, and U.S.A. are tricky cases.']

In [5]:
type(documents)

list

In [6]:
for sentences in documents:
    print(sentences)

 Hello!
I'm testing tokenization.
Let's see how it works.
NLTK's tokenizer is quite powerful—it's useful for NLP tasks.
Python 3.9 is great, isn't it?
What about e-mail addresses like test@example.com?
Hey... check this out: $50, 100%, and U.S.A. are tricky cases.


## Word Tokenization

In [7]:
from nltk.tokenize import word_tokenize #Splits text into words, handling punctuation, contractions, and special cases intelligently.

In [8]:
word_tokenize(corpus)

['Hello',
 '!',
 'I',
 "'m",
 'testing',
 'tokenization',
 '.',
 'Let',
 "'s",
 'see',
 'how',
 'it',
 'works',
 '.',
 'NLTK',
 "'s",
 'tokenizer',
 'is',
 'quite',
 'powerful—it',
 "'s",
 'useful',
 'for',
 'NLP',
 'tasks',
 '.',
 'Python',
 '3.9',
 'is',
 'great',
 ',',
 'is',
 "n't",
 'it',
 '?',
 'What',
 'about',
 'e-mail',
 'addresses',
 'like',
 'test',
 '@',
 'example.com',
 '?',
 'Hey',
 '...',
 'check',
 'this',
 'out',
 ':',
 '$',
 '50',
 ',',
 '100',
 '%',
 ',',
 'and',
 'U.S.A.',
 'are',
 'tricky',
 'cases',
 '.']

In [9]:
for words in documents:
    print(word_tokenize(words))

['Hello', '!']
['I', "'m", 'testing', 'tokenization', '.']
['Let', "'s", 'see', 'how', 'it', 'works', '.']
['NLTK', "'s", 'tokenizer', 'is', 'quite', 'powerful—it', "'s", 'useful', 'for', 'NLP', 'tasks', '.']
['Python', '3.9', 'is', 'great', ',', 'is', "n't", 'it', '?']
['What', 'about', 'e-mail', 'addresses', 'like', 'test', '@', 'example.com', '?']
['Hey', '...', 'check', 'this', 'out', ':', '$', '50', ',', '100', '%', ',', 'and', 'U.S.A.', 'are', 'tricky', 'cases', '.']


## **WordPunct Tokenization in NLP**
`wordpunct_tokenize` is a **tokenization method** from **NLTK** that splits text into words and **punctuation separately**. Unlike standard word tokenization, it treats punctuation as **separate tokens** instead of attaching it to words.

## **Comparison with `word_tokenize`**
| Feature                 | `word_tokenize`  | `wordpunct_tokenize`  |
|-------------------------|-----------------|-----------------------|
| **Handles punctuation** | Keeps it attached to words | Separates punctuation |
| **Example: "I'm fine."** | `["I'm", "fine", "."]` | `["I", "'m", "fine", "."]` |
| **Better for contractions?** | ✅ Keeps contractions together | ❌ Splits contractions |




In [11]:
from nltk.tokenize import wordpunct_tokenize 

In [12]:
wordpunct_tokenize(corpus)

['Hello',
 '!',
 'I',
 "'",
 'm',
 'testing',
 'tokenization',
 '.',
 'Let',
 "'",
 's',
 'see',
 'how',
 'it',
 'works',
 '.',
 'NLTK',
 "'",
 's',
 'tokenizer',
 'is',
 'quite',
 'powerful',
 '—',
 'it',
 "'",
 's',
 'useful',
 'for',
 'NLP',
 'tasks',
 '.',
 'Python',
 '3',
 '.',
 '9',
 'is',
 'great',
 ',',
 'isn',
 "'",
 't',
 'it',
 '?',
 'What',
 'about',
 'e',
 '-',
 'mail',
 'addresses',
 'like',
 'test',
 '@',
 'example',
 '.',
 'com',
 '?',
 'Hey',
 '...',
 'check',
 'this',
 'out',
 ':',
 '$',
 '50',
 ',',
 '100',
 '%,',
 'and',
 'U',
 '.',
 'S',
 '.',
 'A',
 '.',
 'are',
 'tricky',
 'cases',
 '.']

## **TreebankWordDetokenizer in NLP**

`TreebankWordDetokenizer` (from NLTK) **reconstructs a sentence** from tokenized words while handling proper spacing and punctuation.

## **How It Works**
- **Fixes spacing and punctuation** (e.g., `["Hello", ",", "world", "!"]` → `"Hello, world!"`).
- **Handles contractions correctly** (e.g., `["I", "'m"]` → `"I'm"`).

In [14]:
from nltk.tokenize import TreebankWordDetokenizer 

In [15]:
tokenizer= TreebankWordDetokenizer()

In [16]:
token= word_tokenize(corpus)
token

['Hello',
 '!',
 'I',
 "'m",
 'testing',
 'tokenization',
 '.',
 'Let',
 "'s",
 'see',
 'how',
 'it',
 'works',
 '.',
 'NLTK',
 "'s",
 'tokenizer',
 'is',
 'quite',
 'powerful—it',
 "'s",
 'useful',
 'for',
 'NLP',
 'tasks',
 '.',
 'Python',
 '3.9',
 'is',
 'great',
 ',',
 'is',
 "n't",
 'it',
 '?',
 'What',
 'about',
 'e-mail',
 'addresses',
 'like',
 'test',
 '@',
 'example.com',
 '?',
 'Hey',
 '...',
 'check',
 'this',
 'out',
 ':',
 '$',
 '50',
 ',',
 '100',
 '%',
 ',',
 'and',
 'U.S.A.',
 'are',
 'tricky',
 'cases',
 '.']

In [17]:
detokenized_text = tokenizer.detokenize(token)
detokenized_text

"Hello! I'm testing tokenization . Let's see how it works . NLTK's tokenizer is quite powerful—it's useful for NLP tasks . Python 3.9 is great, isn't it? What about e-mail addresses like test @ example.com? Hey...check this out: $50, 100%, and U.S.A. are tricky cases."