Q1) Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using NLTK
library. Use porter stemmer and snowball stemmer for stemming. Use any technique for
lemmatization.

In [None]:
!pip install nltk



In [None]:
import nltk
from nltk.tokenize import WhitespaceTokenizer, WordPunctTokenizer, TreebankWordTokenizer, TweetTokenizer
from nltk.tokenize.mwe import MWETokenizer
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")  # Download WordNet data for lemmatization

# Sample Text
text = "Alice and Bob went to the enchanted forest, where they found a mysterious treasure chest! #AdventureTime"

# Tokenization Techniques
whitespace_tokenizer = WhitespaceTokenizer()
word_punct_tokenizer = WordPunctTokenizer()
treebank_tokenizer = TreebankWordTokenizer()
tweet_tokenizer = TweetTokenizer()

# Multi-word expression tokenizer
mwe_tokenizer = MWETokenizer([("Alice", "and", "Bob"), ("mysterious", "treasure", "chest")])

# Apply Tokenization
whitespace_tokens = whitespace_tokenizer.tokenize(text)
word_punct_tokens = word_punct_tokenizer.tokenize(text)
treebank_tokens = treebank_tokenizer.tokenize(text)
tweet_tokens = tweet_tokenizer.tokenize(text)
mwe_tokens = mwe_tokenizer.tokenize(treebank_tokens)  # Apply MWE on Treebank tokens

# Stemming
porter = PorterStemmer()
snowball = SnowballStemmer("english")

porter_stems = [porter.stem(word) for word in treebank_tokens]
snowball_stems = [snowball.stem(word) for word in treebank_tokens]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in treebank_tokens]

# Display Results
print("\nTokenization")
print("Whitespace Tokenizer:", whitespace_tokens)
print("Punctuation-based Tokenizer:", word_punct_tokens)
print("Treebank Tokenizer:", treebank_tokens)
print("Tweet Tokenizer:", tweet_tokens)
print("Multi-word Expression (MWE) Tokenizer:", mwe_tokens)

print("\nStemming")
print("Porter Stemmer:", porter_stems)
print("Snowball Stemmer:", snowball_stems)

print("\nLemmatization")
print("WordNet Lemmatizer:", lemmas)



Tokenization
Whitespace Tokenizer: ['Alice', 'and', 'Bob', 'went', 'to', 'the', 'enchanted', 'forest,', 'where', 'they', 'found', 'a', 'mysterious', 'treasure', 'chest!', '#AdventureTime']
Punctuation-based Tokenizer: ['Alice', 'and', 'Bob', 'went', 'to', 'the', 'enchanted', 'forest', ',', 'where', 'they', 'found', 'a', 'mysterious', 'treasure', 'chest', '!', '#', 'AdventureTime']
Treebank Tokenizer: ['Alice', 'and', 'Bob', 'went', 'to', 'the', 'enchanted', 'forest', ',', 'where', 'they', 'found', 'a', 'mysterious', 'treasure', 'chest', '!', '#', 'AdventureTime']
Tweet Tokenizer: ['Alice', 'and', 'Bob', 'went', 'to', 'the', 'enchanted', 'forest', ',', 'where', 'they', 'found', 'a', 'mysterious', 'treasure', 'chest', '!', '#AdventureTime']
Multi-word Expression (MWE) Tokenizer: ['Alice_and_Bob', 'went', 'to', 'the', 'enchanted', 'forest', ',', 'where', 'they', 'found', 'a', 'mysterious_treasure_chest', '!', '#', 'AdventureTime']

Stemming
Porter Stemmer: ['alic', 'and', 'bob', 'went', 

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
