### Tokenization

Tokenization is the process of breaking text into smaller units called tokens.

Tokens are the building blocks that NLP models work with - they can be words, subwords, or even characters, depending on the task.

### Why we need Tokenization

Computers can’t directly "understand" raw text like humans.

We break text into tokens so we can:

Analyze word frequency

Build vocabularies

Feed data into models

Do tasks like stemming, lemmatization, sentiment analysis, etc.

In [8]:
import nltk
import nltk
nltk.download('punkt')   # for sentence tokenization
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /Users/shyamsonu/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/shyamsonu/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [9]:
corpus="""Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.
"""

In [10]:
print(corpus)

Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.



### sent_tokenize in NLTK
sent_tokenize() is a sentence tokenizer function in the NLTK (Natural Language Toolkit) library.
It splits a paragraph or large text into individual sentences using a pre-trained Punkt tokenizer model.

In [13]:
## paragraphs-->Sentence
from nltk.tokenize import sent_tokenize

In [14]:
documents=sent_tokenize(corpus)

In [15]:
type(documents)

list

In [16]:
for sentence in documents:
    print(sentence)

Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course!
to become expert in NLP.


### word_tokenize 

word_tokenize() is a function in the NLTK that splits text into individual words and punctuation marks - collectively called tokens.

In [17]:
## Paragraph-->words
## sentence--->words
from nltk.tokenize import word_tokenize

In [18]:
print(corpus)

Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.



In [19]:
word_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [20]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'to', 'Krish', 'Naik', "'s", 'NLP', 'Tutorials', '.']
['Please', 'do', 'watch', 'the', 'entire', 'course', '!']
['to', 'become', 'expert', 'in', 'NLP', '.']


### wordpunct_tokenize in NLTK
wordpunct_tokenize() is a tokenizer in the NLTK library that splits text into tokens by separating all punctuation from words. It breaks the text into sequences of alphabetic and non-alphabetic characters.

#### How it works
Splits text by whitespace and punctuation.

Treats punctuation marks (like ., ,, !, ?) as separate tokens.

Unlike word_tokenize(), it does not specially handle contractions or abbreviations.

It basically separates words and punctuation as distinct tokens without applying linguistic rules.

In [21]:
from nltk.tokenize import wordpunct_tokenize

In [22]:
wordpunct_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'",
 's',
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

### TreebankWordTokenizer in NLTK
TreebankWordTokenizer is a word tokenizer in the NLTK library modeled after the tokenization conventions used in the Penn Treebank corpus — a widely used annotated English text corpus.

It applies linguistic rules designed to tokenize English text into words and punctuation the way linguists expect, especially suitable for formal written English.

Key Features
Splits standard contractions into their component parts, e.g.:

"can't" → ["ca", "n't"]

"I'm" → ["I", "'m"]

Separates punctuation marks like commas, periods, parentheses, quotes, etc. as individual tokens.

Handles common English tokenization quirks, such as:

Splitting off commas from numbers (1,000 → ["1", ",", "000"])

Separating multi-word expressions properly

In [23]:
from nltk.tokenize import TreebankWordTokenizer

In [24]:
tokenizer=TreebankWordTokenizer()

In [25]:
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']