## Tokenization Using NLTK

### Tokenization refers to the process of splitting text into smaller units such as sentences or words. NLTK provides robust tokenizers based on the Punkt and Treebank algorithms.

### Prerequisite: Install and Download Required Resources

import nltk

# Download required datasets
nltk.download('punkt')         # For sentence and word tokenizers
nltk.download('averaged_perceptron_tagger')  # Optional, for POS tagging post-tokenization

In [16]:
## Initiate One Corpus

corpus = """Hello, I am Suraj Khodade. I am Tech Software developer! Tech Enthu.
Please do connect with me on LinkedIn. It's a great platform to network and learn.
I am a Data Science Enthusiast."""

corpus

"Hello, I am Suraj Khodade. I am Tech Software developer! Tech Enthu.\nPlease do connect with me on LinkedIn. It's a great platform to network and learn.\nI am a Data Science Enthusiast."

In [None]:
import nltk

# Download required datasets
nltk.download('punkt')         # For sentence and word tokenizers
nltk.download('averaged_perceptron_tagger')

In [17]:
##  1. Sentence Tokenization
## Objective: Split a paragraph into individual sentences.

text = nltk.sent_tokenize(corpus)   
print(text)

text = nltk.sent_tokenize(corpus, language='english')  # Specify language if needed
print(text)

text_german = nltk.sent_tokenize(corpus, language='german')  # Example for German
print(text_german)

['Hello, I am Suraj Khodade.', 'I am Tech Software developer!', 'Tech Enthu.', 'Please do connect with me on LinkedIn.', "It's a great platform to network and learn.", 'I am a Data Science Enthusiast.']
['Hello, I am Suraj Khodade.', 'I am Tech Software developer!', 'Tech Enthu.', 'Please do connect with me on LinkedIn.', "It's a great platform to network and learn.", 'I am a Data Science Enthusiast.']
['Hello, I am Suraj Khodade.', 'I am Tech Software developer!', 'Tech Enthu.', 'Please do connect with me on LinkedIn.', "It's a great platform to network and learn.", 'I am a Data Science Enthusiast.']


In [19]:
##  2. Word Tokenization
## Objective: Tokenize a sentence into individual words and punctuation marks.
from nltk.tokenize import word_tokenize
word_tokens = word_tokenize(corpus)
print(word_tokens)
word_tokens = word_tokenize(corpus, language='english')  # Specify language if needed
print(word_tokens)
word_tokens_german = word_tokenize(corpus, language='german')  # Example for German
print(word_tokens_german)

['Hello', ',', 'I', 'am', 'Suraj', 'Khodade', '.', 'I', 'am', 'Tech', 'Software', 'developer', '!', 'Tech', 'Enthu', '.', 'Please', 'do', 'connect', 'with', 'me', 'on', 'LinkedIn', '.', 'It', "'s", 'a', 'great', 'platform', 'to', 'network', 'and', 'learn', '.', 'I', 'am', 'a', 'Data', 'Science', 'Enthusiast', '.']
['Hello', ',', 'I', 'am', 'Suraj', 'Khodade', '.', 'I', 'am', 'Tech', 'Software', 'developer', '!', 'Tech', 'Enthu', '.', 'Please', 'do', 'connect', 'with', 'me', 'on', 'LinkedIn', '.', 'It', "'s", 'a', 'great', 'platform', 'to', 'network', 'and', 'learn', '.', 'I', 'am', 'a', 'Data', 'Science', 'Enthusiast', '.']
['Hello', ',', 'I', 'am', 'Suraj', 'Khodade', '.', 'I', 'am', 'Tech', 'Software', 'developer', '!', 'Tech', 'Enthu', '.', 'Please', 'do', 'connect', 'with', 'me', 'on', 'LinkedIn', '.', 'It', "'s", 'a', 'great', 'platform', 'to', 'network', 'and', 'learn', '.', 'I', 'am', 'a', 'Data', 'Science', 'Enthusiast', '.']


In [21]:
## 3. Treebank Tokenizer (Advanced Word Tokenizer)
## Provides more refined tokenization, closer to linguistic accuracy.
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize("Don't hesitate to ask questions.")

print(tokens)

['Do', "n't", 'hesitate', 'to', 'ask', 'questions', '.']


In [22]:
## 4. WordPunct Tokenizer (Splits punctuation from words)
## Useful when punctuation separation is important.
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize("Let's test: email@example.com!")

print(tokens)

['Let', "'", 's', 'test', ':', 'email', '@', 'example', '.', 'com', '!']


In [25]:
## 5. Regex-Based Tokenization
## Custom tokenization using regular expressions.
from nltk.tokenize import regexp_tokenize

text = "This costs $1.99, and that costs $2.99."
tokens = regexp_tokenize(text, pattern=r'\$\d+\.\d+|\w+')
print(tokens)

['This', 'costs', '$1.99', 'and', 'that', 'costs', '$2.99']


#### Comparative Analysis of NLTK Tokenization Methods

| **Aspect**               | **Sentence Tokenization** (`sent_tokenize`) | **Word Tokenization** (`word_tokenize`) | **Treebank Tokenization** (`TreebankWordTokenizer`) | **Regex-Based Tokenization** (`RegexpTokenizer`)     |
| ------------------------ | ------------------------------------------- | --------------------------------------- | --------------------------------------------------- | ---------------------------------------------------- |
| **Purpose**              | Split paragraph into individual sentences   | Split sentence into words               | Precise word splitting based on Penn Treebank rules | Tokenize based on custom patterns                    |
| **Granularity**          | Sentence-level                              | Word-level                              | Word-level                                          | Pattern-level (e.g., words, symbols, entities)       |
| **Handles Punctuation**  | Retains sentence-ending punctuation         | Yes (e.g., `.`, `!`, `?`)               | Yes, intelligently separates contractions           | As per regex pattern                                 |
| **Customization**        | Minimal (language model-based)              | Minimal                                 | Minimal                                             | Fully customizable using regex                       |
| **Multilingual Support** | Yes (via Punkt model)                       | Limited                                 | English-centric                                     | Yes (regex-agnostic)                                 |
| **Syntax**               | `sent_tokenize(text)`                       | `word_tokenize(text)`                   | `TreebankWordTokenizer().tokenize(text)`            | `RegexpTokenizer(pattern).tokenize(text)`            |
| **Tokenizer Class**      | `PunktSentenceTokenizer`                    | `PunktWordTokenizer` (used internally)  | `TreebankWordTokenizer`                             | `RegexpTokenizer`                                    |
| **Example Input**        | `"Dr. Smith went home. He slept."`          | `"Let’s write Python!"`                 | `"Don't panic."`                                    | `"Price: $12.50 or ₹99"`                             |
| **Example Output**       | `["Dr. Smith went home.", "He slept."]`     | `["Let", "’s", "write", "Python", "!"]` | `["Do", "n't", "panic", "."]`                       | `["Price", "$12.50", "or", "₹99"]`                   |
| **Use Cases**            | Document parsing, summarization             | Preprocessing, feature extraction       | Linguistic analysis, POS tagging                    | Financial extraction, NER preprocessing, log parsing |
| **Performance**          | High (pretrained model)                     | High                                    | High                                                | Medium–High (regex complexity dependent)             |


#### Summary Recommendation

| Scenario                           | Recommended Tokenizer                      |
| ---------------------------------- | ------------------------------------------ |
| Sentence splitting                 | `sent_tokenize()`                          |
| General NLP preprocessing          | `word_tokenize()`                          |
| Linguistic precision (POS, syntax) | `TreebankWordTokenizer()`                  |
| Pattern-driven tasks (custom)      | `RegexpTokenizer()` or `regexp_tokenize()` |

##### Sentence Tokenization
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)

##### Word Tokenization
from nltk.tokenize import word_tokenize
words = word_tokenize(text)

##### Treebank Tokenization
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)

##### Regex Tokenization
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\$\d+\.\d+|\w+')
custom_tokens = tokenizer.tokenize(text)
