 **Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using NLTK library. Use porter stemmer and snowball stemmer for stemming. Use any technique for lemmatization.**

In [None]:
!pip install nltk




In [None]:
import nltk
from nltk.tokenize import (WhitespaceTokenizer, WordPunctTokenizer, TreebankWordTokenizer, TweetTokenizer, MWETokenizer)
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


In [None]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
text ="Hello there! This is the first assignment of NLP ."

**Tokenization: Tokenization  is the process of breaking text into smaller units such as words or phrases.**

**Types -**

**1. Whitespace Tokenization: Splits text based on spaces.**

In [None]:
# 1. Tokenization
print("\n--- Tokenization ---")

# Whitespace-based Tokenization
whitespace_tokens = WhitespaceTokenizer().tokenize(text)
print("Whitespace Tokenization:", whitespace_tokens)


--- Tokenization ---
Whitespace Tokenization: ['Hello', 'there!', 'This', 'is', 'the', 'first', 'assignment', 'of', 'NLP', '.']


**2. Punctuation-Based Tokenization (WordPunctTokenizer): Splits text based on both spaces and punctuation.**

In [None]:
# Punctuation-based Tokenization
punctuation_tokens = WordPunctTokenizer().tokenize(text)
print("Punctuation Tokenization:", punctuation_tokens)

Punctuation Tokenization: ['Hello', 'there', '!', 'This', 'is', 'the', 'first', 'assignment', 'of', 'NLP', '.']


**3. Treebank Tokenization: Uses the Penn Treebank rules to split text, handling contractions and punctuation properly.**

In [None]:
# Treebank Tokenization
from nltk.tokenize import TreebankWordTokenizer # Ensure TreebankWordTokenizer is imported
treebank_tokens = TreebankWordTokenizer().tokenize(text) # Define treebank_tokens before using it
print("Treebank Tokenization:", treebank_tokens)

Treebank Tokenization: ['Hello', 'there', '!', 'This', 'is', 'the', 'first', 'assignment', 'of', 'NLP', '.']


**4. Tweet Tokenization: Special tokenization designed for social media text, preserving hashtags, emojis, and mentions.**

In [None]:
# Tweet Tokenization
tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(text)
print("Tweet Tokenization:", tweet_tokens)

Tweet Tokenization: ['Hello', 'there', '!', 'This', 'is', 'the', 'first', 'assignment', 'of', 'NLP', '.']


**5. Multi-Word Expression (MWE) Tokenization: Recognizes and keeps predefined multi-word expressions together.**





In [None]:
# Download the missing punkt_tab data package
import nltk
nltk.download('punkt_tab')

# Multi-Word Expression Tokenization
mwe_tokenizer = MWETokenizer([("learning", "NLP"), ("including", "tokenization")])
mwe_tokens = mwe_tokenizer.tokenize(nltk.word_tokenize(text))
print("MWE Tokenization:", mwe_tokens)

MWE Tokenization: ['Hello', 'there', '!', 'This', 'is', 'the', 'first', 'assignment', 'of', 'NLP', '.']


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


**2. Stemming : Stemming is the process of reducing words to their root form by removing suffixes. It does not guarantee real words but helps in reducing word variations.**

**1. Porter Stemmer: A simple and widely used stemming algorithm that removes common suffixes.**

In [None]:
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.tokenize import word_tokenize
input_string = "running walking jumped easily cats"

# Tokenizing the input string
whitespace_tokens = word_tokenize(input_string)

In [None]:
# 2. Stemming
print("\n--- Stemming ---")

porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")

porter_stemmed = [porter_stemmer.stem(word) for word in whitespace_tokens]
print("Porter Stemmer:", porter_stemmed)


--- Stemming ---
Porter Stemmer: ['run', 'walk', 'jump', 'easili', 'cat']


**2. Snowball Stemmer: An improved version of Porter Stemmer, supporting multiple languages and handling words better.**

In [None]:
snowball_stemmed = [snowball_stemmer.stem(word) for word in whitespace_tokens]
print("Snowball Stemmer:", snowball_stemmed)

Snowball Stemmer: ['run', 'walk', 'jump', 'easili', 'cat']


**Lemmatization:Lemmatization is a more advanced technique than stemming. Instead of chopping off suffixes, it converts words into their dictionary (base) form, called the lemma.**

**WordNet Lemmatizer: Uses WordNet, a large lexical database, to find the base form of words. It considers the context and part of speech. bold text**

In [None]:
# 3. Lemmatization
print("\n--- Lemmatization ---")

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word, wordnet.VERB) for word in whitespace_tokens] # This line creates the 'lemmatized' variable and assigns it a value
print("Lemmatized:", lemmatized) # Now you can print 'lemmatized' as it's been defined


--- Lemmatization ---
Lemmatized: ['run', 'walk', 'jump', 'easily', 'cat']
