## Tokenization in NLP using NLTK
NLTK provides robust tokenization utilities that support both word-level and sentence-level tokenization based on rule-based and regular-expression-based methods.

### 🔍 Tokenization: Definition and Importance
📖 Definition:
Tokenization is the process of converting a sequence of text (such as a sentence or paragraph) into smaller units called tokens. These tokens can be:

Words (word-level tokenization)

Subwords (subword-level tokenization)

Characters (character-level tokenization)

It is typically the first step in most NLP pipelines.

### 🧱 Why Tokenization is Importan
| Purpose                 | Description                                                             |
| ----------------------- | ----------------------------------------------------------------------- |
| **Input Structuring**   | Converts raw, unstructured text into a structured format for ML models. |
| **Vocabulary Mapping**  | Helps build a vocabulary for embedding and model training.              |
| **Feature Extraction**  | Enables computation of word frequencies, TF-IDF scores, and embeddings. |
| **Contextual Analysis** | Necessary for POS tagging, NER, parsing, etc.                           |


### 🧱 Types of Tokenization in NLTK
| Type                           | Method/Class              | Description                                                                     | Use Case                                                               |
| ------------------------------ | ------------------------- | ------------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **1. Word Tokenization**       | `word_tokenize()`         | Splits a sentence into words and punctuation using the Penn Treebank tokenizer. | Standard word-level tokenization for most NLP tasks.                   |
| **2. Sentence Tokenization**   | `sent_tokenize()`         | Splits a text into a list of sentences.                                         | Used in paragraph segmentation and sentence-level analysis.            |
| **3. Regex Tokenization**      | `RegexpTokenizer()`       | Tokenizes text using custom regular expressions.                                | For structured/customized token patterns (e.g., removing punctuation). |
| **4. Whitespace Tokenization** | `WhitespaceTokenizer()`   | Splits tokens based purely on whitespace.                                       | Useful when punctuation should be retained or for formatted data.      |
| **5. WordPunct Tokenization**  | `WordPunctTokenizer()`    | Splits text into alphabetic and non-alphabetic characters.                      | Granular tokenization; separates words and punctuations distinctly.    |
| **6. Treebank Tokenization**   | `TreebankWordTokenizer()` | Mimics Penn Treebank-style tokenization with specific rules.                    | For linguistically consistent and corpus-aligned NLP tasks.            |
| **7. Blankline Tokenization**  | `BlanklineTokenizer()`    | Splits paragraphs based on blank lines.                                         | When dealing with multiple paragraphs separated by empty lines.        |


### Example

In [87]:
import nltk
from nltk.tokenize import (
    word_tokenize, sent_tokenize, RegexpTokenizer,
    WhitespaceTokenizer, WordPunctTokenizer,
    TreebankWordTokenizer, BlanklineTokenizer
)

nltk.download('punkt')

text = "Hello! I'm Suraj. I work in A.I., and I love NLP. Don't you? Let's tokenize this text."

# 1. Word Tokenizer
word_tok = word_tokenize(text)

# 2. Sentence Tokenizer
sent_tok = sent_tokenize(text)

# 3. Regex Tokenizer (alphanumeric only)
regex_tok = RegexpTokenizer(r'\w+').tokenize(text)

# 4. Whitespace Tokenizer
whitespace_tok = WhitespaceTokenizer().tokenize(text)

# 5. WordPunct Tokenizer
word_punct_tok = WordPunctTokenizer().tokenize(text)

# 6. Treebank Word Tokenizer
treebank_tok = TreebankWordTokenizer().tokenize(text)

# 7. Blankline Tokenizer (used for multi-paragraph)
paragraph_text = "Hello! I'm Suraj.\n\nI work in A.I., and I love NLP.\n\nDon't you?"
blankline_tok = BlanklineTokenizer().tokenize(paragraph_text)

# Print results
print("1️⃣ Word Tokenizer:", word_tok)
print("2️⃣ Sentence Tokenizer:", sent_tok)
print("3️⃣ Regex Tokenizer:", regex_tok)
print("4️⃣ Whitespace Tokenizer:", whitespace_tok)
print("5️⃣ WordPunct Tokenizer:", word_punct_tok)
print("6️⃣ Treebank Word Tokenizer:", treebank_tok)
print("7️⃣ Blankline Tokenizer:", blankline_tok)


1️⃣ Word Tokenizer: ['Hello', '!', 'I', "'m", 'Suraj', '.', 'I', 'work', 'in', 'A.I.', ',', 'and', 'I', 'love', 'NLP', '.', 'Do', "n't", 'you', '?', 'Let', "'s", 'tokenize', 'this', 'text', '.']
2️⃣ Sentence Tokenizer: ['Hello!', "I'm Suraj.", 'I work in A.I., and I love NLP.', "Don't you?", "Let's tokenize this text."]
3️⃣ Regex Tokenizer: ['Hello', 'I', 'm', 'Suraj', 'I', 'work', 'in', 'A', 'I', 'and', 'I', 'love', 'NLP', 'Don', 't', 'you', 'Let', 's', 'tokenize', 'this', 'text']
4️⃣ Whitespace Tokenizer: ['Hello!', "I'm", 'Suraj.', 'I', 'work', 'in', 'A.I.,', 'and', 'I', 'love', 'NLP.', "Don't", 'you?', "Let's", 'tokenize', 'this', 'text.']
5️⃣ WordPunct Tokenizer: ['Hello', '!', 'I', "'", 'm', 'Suraj', '.', 'I', 'work', 'in', 'A', '.', 'I', '.,', 'and', 'I', 'love', 'NLP', '.', 'Don', "'", 't', 'you', '?', 'Let', "'", 's', 'tokenize', 'this', 'text', '.']
6️⃣ Treebank Word Tokenizer: ['Hello', '!', 'I', "'m", 'Suraj.', 'I', 'work', 'in', 'A.I.', ',', 'and', 'I', 'love', 'NLP.', 'Do

[nltk_data] Downloading package punkt to C:\Users\Suraj
[nltk_data]     Khodade\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 📊 Comparison of Outputs
| Tokenizer                      | Output                                                                                                                                                 | Notes                                                                    |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------ |
| **1. word\_tokenize()**        | `['Hello', '!', 'I', "'m", 'Suraj', '.', 'I', 'work', 'in', 'A.I.', ',', 'and', 'I', 'love', 'NLP', '.', 'Do', "n't", 'you', '?']`                     | Accurate, handles contractions & punctuations.                           |
| **2. sent\_tokenize()**        | `['Hello!', "I'm Suraj.", 'I work in A.I., and I love NLP.', "Don't you?"]`                                                                            | Segments text into sentences.                                            |
| **3. RegexpTokenizer(r'\w+')** | `['Hello', 'I', 'm', 'Suraj', 'I', 'work', 'in', 'A', 'I', 'and', 'I', 'love', 'NLP', 'Don', 't', 'you']`                                              | Removes punctuation; breaks contractions and abbreviations.              |
| **4. WhitespaceTokenizer()**   | `['Hello!', "I'm", 'Suraj.', 'I', 'work', 'in', 'A.I.,', 'and', 'I', 'love', 'NLP.', "Don't", 'you?']`                                                 | Preserves punctuation, splits only on spaces.                            |
| **5. WordPunctTokenizer()**    | `['Hello', '!', 'I', "'", 'm', 'Suraj', '.', 'I', 'work', 'in', 'A', '.', 'I', '.', ',', 'and', 'I', 'love', 'NLP', '.', 'Don', "'", 't', 'you', '?']` | Separates everything aggressively; breaks punctuations and contractions. |
| **6. TreebankWordTokenizer()** | `['Hello', '!', 'I', "'m", 'Suraj', '.', 'I', 'work', 'in', 'A.I.', ',', 'and', 'I', 'love', 'NLP', '.', 'Do', "n't", 'you', '?']`                     | Matches word\_tokenize behavior (since it’s used internally).            |
| **7. BlanklineTokenizer()**    | `["Hello! I'm Suraj.", 'I work in A.I., and I love NLP.', "Don't you?"]`                                                                               | Splits text into paragraphs based on blank lines.                        |


In [70]:
corpus = "Hello ! I am Suraj Khodade.. It's my git repo for AIBootcamp. I am a Data Scientist, Machine Learning Engineer. I love to work on NLP projects. I am currently learning about tokenization in NLP using NLTK library."

In [71]:
corpus

"Hello ! I am Suraj Khodade.. It's my git repo for AIBootcamp. I am a Data Scientist, Machine Learning Engineer. I love to work on NLP projects. I am currently learning about tokenization in NLP using NLTK library."

In [72]:
print(corpus)

Hello ! I am Suraj Khodade.. It's my git repo for AIBootcamp. I am a Data Scientist, Machine Learning Engineer. I love to work on NLP projects. I am currently learning about tokenization in NLP using NLTK library.


In [73]:
from nltk.tokenize import sent_tokenize

In [74]:
document = sent_tokenize(corpus, language='english')
document

['Hello !',
 'I am Suraj Khodade..',
 "It's my git repo for AIBootcamp.",
 'I am a Data Scientist, Machine Learning Engineer.',
 'I love to work on NLP projects.',
 'I am currently learning about tokenization in NLP using NLTK library.']

In [75]:
type(document)

list

In [76]:
for sentense in document:
    print(sentense) 

Hello !
I am Suraj Khodade..
It's my git repo for AIBootcamp.
I am a Data Scientist, Machine Learning Engineer.
I love to work on NLP projects.
I am currently learning about tokenization in NLP using NLTK library.


In [77]:
## word tokenization
from nltk.tokenize import word_tokenize

In [78]:
word_document = word_tokenize(corpus, language='english')
word_document

['Hello',
 '!',
 'I',
 'am',
 'Suraj',
 'Khodade',
 '..',
 'It',
 "'s",
 'my',
 'git',
 'repo',
 'for',
 'AIBootcamp',
 '.',
 'I',
 'am',
 'a',
 'Data',
 'Scientist',
 ',',
 'Machine',
 'Learning',
 'Engineer',
 '.',
 'I',
 'love',
 'to',
 'work',
 'on',
 'NLP',
 'projects',
 '.',
 'I',
 'am',
 'currently',
 'learning',
 'about',
 'tokenization',
 'in',
 'NLP',
 'using',
 'NLTK',
 'library',
 '.']

In [79]:
word_sentence : list = [] 
for sentense in document:
    word_sentence.append(word_tokenize(sentense, language='english'))

word_sentence



[['Hello', '!'],
 ['I', 'am', 'Suraj', 'Khodade', '..'],
 ['It', "'s", 'my', 'git', 'repo', 'for', 'AIBootcamp', '.'],
 ['I',
  'am',
  'a',
  'Data',
  'Scientist',
  ',',
  'Machine',
  'Learning',
  'Engineer',
  '.'],
 ['I', 'love', 'to', 'work', 'on', 'NLP', 'projects', '.'],
 ['I',
  'am',
  'currently',
  'learning',
  'about',
  'tokenization',
  'in',
  'NLP',
  'using',
  'NLTK',
  'library',
  '.']]

In [80]:
for word in word_document:
    print(word)

Hello
!
I
am
Suraj
Khodade
..
It
's
my
git
repo
for
AIBootcamp
.
I
am
a
Data
Scientist
,
Machine
Learning
Engineer
.
I
love
to
work
on
NLP
projects
.
I
am
currently
learning
about
tokenization
in
NLP
using
NLTK
library
.


In [81]:
## word punctuation tokenization
## used to split words and punctuation
from nltk.tokenize import wordpunct_tokenize

In [82]:
word_document = wordpunct_tokenize(corpus)
word_document

['Hello',
 '!',
 'I',
 'am',
 'Suraj',
 'Khodade',
 '..',
 'It',
 "'",
 's',
 'my',
 'git',
 'repo',
 'for',
 'AIBootcamp',
 '.',
 'I',
 'am',
 'a',
 'Data',
 'Scientist',
 ',',
 'Machine',
 'Learning',
 'Engineer',
 '.',
 'I',
 'love',
 'to',
 'work',
 'on',
 'NLP',
 'projects',
 '.',
 'I',
 'am',
 'currently',
 'learning',
 'about',
 'tokenization',
 'in',
 'NLP',
 'using',
 'NLTK',
 'library',
 '.']

In [83]:
from nltk.tokenize import TreebankWordTokenizer

## treebank word tokenizer used to tokenize words in a way that is similar to the Penn Treebank
## different from wordpunct_tokenize as it does not split punctuation
## full stops are not considered as separate tokens except at the end of a sentence

In [84]:
word_treebank = TreebankWordTokenizer()
word_treebank.tokenize(corpus)


['Hello',
 '!',
 'I',
 'am',
 'Suraj',
 'Khodade..',
 'It',
 "'s",
 'my',
 'git',
 'repo',
 'for',
 'AIBootcamp.',
 'I',
 'am',
 'a',
 'Data',
 'Scientist',
 ',',
 'Machine',
 'Learning',
 'Engineer.',
 'I',
 'love',
 'to',
 'work',
 'on',
 'NLP',
 'projects.',
 'I',
 'am',
 'currently',
 'learning',
 'about',
 'tokenization',
 'in',
 'NLP',
 'using',
 'NLTK',
 'library',
 '.']