# **What is Tokenization?**

Tokenization is a fundamental initial step in Natural Language Processing (NLP) that involves segmenting a continuous stream of text into smaller, meaningful units called "tokens." These tokens can be individual words, numbers, punctuation marks, or even subword units, depending on the specific tokenizer used and the context of the NLP task. The primary goal of tokenization is to break down raw text into a structured sequence that can be more easily processed and analyzed by algorithms, serving as the groundwork for subsequent NLP operations such as parsing, text analysis, and machine learning models.

In [1]:
!pip install --q nltk

In [15]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab') # Added to resolve the LookupError

print("### Tokenization Explained Step-by-Step ###")

# Tokenization is the process of breaking down a text into smaller units called tokens.
# These tokens can be words, phrases, or even whole sentences, depending on the type of tokenization.
# It's a fundamental step in most NLP tasks.


## Example 1: Sentence Tokenization
# Goal: Divide a paragraph or document into individual sentences.
# NLTK's `sent_tokenize` is excellent for this.

print("\n--- Example 1: Sentence Tokenization ---")
# Define a new example corpus for sentence tokenization
sentence_corpus = """Natural Language Processing is an exciting field. It combines computer science, artificial intelligence, and linguistics. Many applications benefit from NLP, such as machine translation and sentiment analysis."""

print("Original Text:")
print(sentence_corpus)

# Step 1: Import the necessary function
from nltk.tokenize import sent_tokenize

# Step 2: Apply sentence tokenization
sentences = sent_tokenize(sentence_corpus)

# Step 3: Print and explain the output
print("\nTokenized Sentences:")
for i, sent in enumerate(sentences):
    print(f"Sentence {i+1}: '{sent}'")
print("Explanation: `sent_tokenize` effectively identifies sentence boundaries, often based on punctuation like periods, question marks, and exclamation marks. Each string in the output list represents a complete sentence.")


## Example 2: Word Tokenization (Standard)
# Goal: Divide a sentence or text into individual words.
# NLTK's `word_tokenize` is a common and versatile choice.

print("\n--- Example 2: Word Tokenization (Standard) ---")
# Define a new example sentence for word tokenization
word_corpus_1 = "I love learning about NLP! It's truly fascinating."

print("Original Text:")
print(word_corpus_1)

# Step 1: Import the necessary function
from nltk.tokenize import word_tokenize

# Step 2: Apply word tokenization
words_1 = word_tokenize(word_corpus_1)

# Step 3: Print and explain the output
print("\nTokenized Words:")
print(words_1)
print("Explanation: `word_tokenize` splits the text into words and punctuation marks. Notice how 'NLP!' is split into 'NLP' and '!' and 'It's' is split into 'It', ''s'. This method handles most contractions and punctuation quite well.")


## Example 3: Word Tokenization (TreebankWordTokenizer)
# Goal: Divide a sentence into words using a tokenizer specifically trained on Penn Treebank data.
# This tokenizer often has specific rules for contractions and punctuation that differ slightly from `word_tokenize`.

print("\n--- Example 3: Word Tokenization (TreebankWordTokenizer) ---")
# Define an example sentence to highlight Treebank specific tokenization
word_corpus_2 = "Don't forget to 'code' in Colab; it's so much fun!"

print("Original Text:")
print(word_corpus_2)

# Step 1: Import and instantiate the tokenizer
from nltk.tokenize import TreebankWordTokenizer
treebank_tokenizer = TreebankWordTokenizer()

# Step 2: Apply Treebank word tokenization
words_2 = treebank_tokenizer.tokenize(word_corpus_2)

# Step 3: Print and explain the output
print("\nTokenized Words (Treebank):")
print(words_2)
print("Explanation: The `TreebankWordTokenizer` uses a set of rules from the Penn Treebank project. It often handles contractions ('Don't' -> ['Do', \"n't\"], 'it's' -> ['it', \"'s\"]) and punctuation (like apostrophes in 'code') in a specific, consistent manner, which can be beneficial for certain parsing tasks. Compared to `word_tokenize`, its handling of some edge cases, especially contractions and certain punctuation, can be more granular or linguistically motivated.")

### Tokenization Explained Step-by-Step ###

--- Example 1: Sentence Tokenization ---
Original Text:
Natural Language Processing is an exciting field. It combines computer science, artificial intelligence, and linguistics. Many applications benefit from NLP, such as machine translation and sentiment analysis.

Tokenized Sentences:
Sentence 1: 'Natural Language Processing is an exciting field.'
Sentence 2: 'It combines computer science, artificial intelligence, and linguistics.'
Sentence 3: 'Many applications benefit from NLP, such as machine translation and sentiment analysis.'
Explanation: `sent_tokenize` effectively identifies sentence boundaries, often based on punctuation like periods, question marks, and exclamation marks. Each string in the output list represents a complete sentence.

--- Example 2: Word Tokenization (Standard) ---
Original Text:
I love learning about NLP! It's truly fascinating.

Tokenized Words:
['I', 'love', 'learning', 'about', 'NLP', '!', 'It', "'s", 'truly'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [2]:
corpus="""Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.
"""

In [3]:
print(corpus)

Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.



In [4]:
##  Tokenization
## Sentence-->paragraphs
from nltk.tokenize import sent_tokenize

In [12]:
documents=sent_tokenize(corpus)

In [13]:
type(documents)

list

In [14]:
for sentence in documents:
    print(sentence)

Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course!
to become expert in NLP.


In [None]:
## Tokenization
## Paragraph-->words
## sentence--->words
from nltk.tokenize import word_tokenize

In [None]:
word_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [None]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'to', 'Krish', 'Naik', "'s", 'NLP', 'Tutorials', '.']
['Please', 'do', 'watch', 'the', 'entire', 'course', '!']
['to', 'become', 'expert', 'in', 'NLP', '.']


In [None]:
from nltk.tokenize import wordpunct_tokenize

In [None]:
wordpunct_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'",
 's',
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [None]:
from nltk.tokenize import TreebankWordTokenizer

In [None]:
tokenizer=TreebankWordTokenizer()

In [None]:
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']