<a href="https://colab.research.google.com/github/IamJustKiran/aimlbootcamp/blob/main/Welcome_To_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import nltk

# Download required NLTK resources
nltk.download("punkt")  # sentence and word tokenizers
nltk.download("punkt_tab")  # optional: updated tables for punkt

# Sample text (corpus)
corpus = """Hello Welcome, to Kiran learning tutorial.
Please do watch the entire course! to become expert in NLP."""

# ------------------------------------------------------------
# 1. Sentence Tokenization
# Splits a paragraph/text into individual sentences.
# Uses "PunktSentenceTokenizer" under the hood.
documents = nltk.sent_tokenize(corpus)
print("Sentence Tokenization:")
print(documents)
print("-" * 50)

# ------------------------------------------------------------
# 2. Word Tokenization
# Splits text into words (tokens) but keeps punctuation as separate tokens.
words = nltk.word_tokenize(corpus)
print("Word Tokenization (with punctuation):")
print(words)
print("-" * 50)

# ------------------------------------------------------------
# 3. Tokenize each sentence into words
print("Sentence to Word Tokenization:")
for sentence in documents:
    print(nltk.word_tokenize(sentence))
print("-" * 50)

# ------------------------------------------------------------
# 4. WordPunctTokenizer
# Splits words *and* punctuation into separate tokens.
# Example: "NLP." → ["NLP", "."]
words2 = nltk.wordpunct_tokenize(corpus)
print("WordPunct Tokenization (splits punctuation separately):")
print(words2)
print("-" * 50)

# ------------------------------------------------------------
# 5. TreebankWordTokenizer
# More sophisticated: handles contractions & punctuation better.
# Example: "don't" → ["do", "n't"], "NLP." → ["NLP", "."]
treebank_tokenizer = nltk.TreebankWordTokenizer()
words3 = treebank_tokenizer.tokenize(corpus)
print("Treebank Tokenization (handles contractions smartly):")
print(words3)
print("-" * 50)

# ------------------------------------------------------------
# 6. TreebankWordDetokenizer
# Opposite of tokenization — joins a list of tokens back into text.
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenizer = TreebankWordDetokenizer()
print("Detokenization (joining tokens back):")
print(detokenizer.detokenize(words3))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Sentence Tokenization:
['Hello Welcome, to Kiran learning tutorial.', 'Please do watch the entire course!', 'to become expert in NLP.']
--------------------------------------------------
Word Tokenization (with punctuation):
['Hello', 'Welcome', ',', 'to', 'Kiran', 'learning', 'tutorial', '.', 'Please', 'do', 'watch', 'the', 'entire', 'course', '!', 'to', 'become', 'expert', 'in', 'NLP', '.']
--------------------------------------------------
Sentence to Word Tokenization:
['Hello', 'Welcome', ',', 'to', 'Kiran', 'learning', 'tutorial', '.']
['Please', 'do', 'watch', 'the', 'entire', 'course', '!']
['to', 'become', 'expert', 'in', 'NLP', '.']
--------------------------------------------------
WordPunct Tokenization (splits punctuation separately):
['Hello', 'Welcome', ',', 'to', 'Kiran', 'learning', 'tutorial', '.', 'Please', 'do', 'watch', 'the', 'entire', 'course', '!', 'to', 'become', 'expert', 'in', 'NLP', '.']
--------------------------------------------------
Treebank Tokenizatio

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
