# 1.Perform tokenization (Whitespace, Punctuation-based, Treebank, Tweet, MWE) using NLTK library.
# Use porter stemmer and snowball stemmer for stemming. Use any technique for lemmatization. 

Tokenization: The process of breaking down a text into smaller units, which are typically words or subwords. These units are called tokens. Tokenization can involve splitting text based on whitespace, punctuation, or specific rules depending on the context or purpose of analysis.

Whitespace Tokenization: This method involves splitting text based on spaces, tabs, or line breaks. Each word or group of characters separated by whitespace becomes a token.

Punctuation-based Tokenization: In this approach, text is divided into tokens based on punctuation marks such as commas, periods, exclamation marks, etc. Punctuation marks themselves can either be treated as separate tokens or ignored.

Treebank Tokenization: Treebank tokenization is a specific tokenization scheme commonly used in natural language processing (NLP). It involves breaking down text into tokens according to linguistic rules, often based on the syntactic structure of sentences.

Tweet Tokenization: This is a specialized form of tokenization tailored for processing tweets on social media platforms like Twitter. It typically considers hashtags, mentions, emojis, and other Twitter-specific elements.

MWE (Multi-Word Expression): Multi-word expressions are sequences of words that function as a single unit in language. Examples include "kick the bucket" and "break the ice." Tokenization methods may treat MWEs as single tokens to preserve their meaning.

Stemming: Stemming is the process of reducing words to their base or root form, often by removing suffixes or prefixes. The goal is to map related words to the same stem, thereby reducing variation in the vocabulary. Stemming algorithms may not always produce valid words.

Stemmer: A stemmer is an algorithm or program that performs stemming. It applies predefined rules to strip affixes from words to obtain their stems.

Porter Stemmer: One of the most well-known stemming algorithms developed by Martin Porter in 1980. It uses a set of rules to strip common suffixes from English words to obtain their stems. While simple and efficient, it may produce stems that are not actual words.

Snowball Stemmer: Also known as the Porter2 stemmer, it is an improvement over the original Porter Stemmer. Snowball is a framework for developing stemming algorithms for various languages. It offers better performance and accuracy compared to the Porter Stemmer.

Lemmatization: Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma. Unlike stemming, lemmatization considers the context of words and aims to produce valid lemmas that are actual words. It often involves dictionary lookup and morphological analysis.

In [1]:
import nltk
from nltk.tokenize import word_tokenize, wordpunct_tokenize, TreebankWordTokenizer, TweetTokenizer
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')

# Sample text
text = "NLTK is a powerful library for natural language processing tasks. It's #awesome!"

# Tokenization
print("\n**Whitespace Tokenization:**")
tokens_ws = word_tokenize(text)
print(tokens_ws)

print("\n**Punctuation-based Tokenization:**")
tokens_punct = wordpunct_tokenize(text)
print(tokens_punct)

print("\n**Treebank Tokenization:**")
treebank_tokenizer = TreebankWordTokenizer()
tokens_treebank = treebank_tokenizer.tokenize(text)
print(tokens_treebank)

print("\n**Tweet Tokenization:**")
tweet_tokenizer = TweetTokenizer()
tokens_tweet = tweet_tokenizer.tokenize(text)
print(tokens_tweet)

# Stemming
print("\n**Porter Stemming:**")
porter_stemmer = PorterStemmer()
stems_porter = [porter_stemmer.stem(token) for token in tokens_ws]
print(stems_porter)

print("\n**Snowball Stemming:**")
snowball_stemmer = SnowballStemmer("english")  # Choose a language
stems_snowball = [snowball_stemmer.stem(token) for token in tokens_ws]
print(stems_snowball)

# Lemmatization
print("\n**Lemmatization:**")
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in tokens_ws]
print(lemmas)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adwai\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\adwai\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



**Whitespace Tokenization:**
['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural', 'language', 'processing', 'tasks', '.', 'It', "'s", '#', 'awesome', '!']

**Punctuation-based Tokenization:**
['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural', 'language', 'processing', 'tasks', '.', 'It', "'", 's', '#', 'awesome', '!']

**Treebank Tokenization:**
['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural', 'language', 'processing', 'tasks.', 'It', "'s", '#', 'awesome', '!']

**Tweet Tokenization:**
['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural', 'language', 'processing', 'tasks', '.', "It's", '#awesome', '!']

**Porter Stemming:**
['nltk', 'is', 'a', 'power', 'librari', 'for', 'natur', 'languag', 'process', 'task', '.', 'it', "'s", '#', 'awesom', '!']

**Snowball Stemming:**
['nltk', 'is', 'a', 'power', 'librari', 'for', 'natur', 'languag', 'process', 'task', '.', 'it', "'s", '#', 'awesom', '!']

**Lemmatization:**
['NLTK', 'is', 'a', 'powerful', 'library