# Lab Assignment 1: NLP Techniques

This notebook demonstrates Tokenization, Stemming, and Lemmatization using NLTK.

## Import Libraries in system

In [1]:

import nltk
nltk.download('punkt')
nltk.download('wordnet')

from nltk.tokenize import WhitespaceTokenizer, wordpunct_tokenize, TreebankWordTokenizer, TweetTokenizer, MWETokenizer
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\akash\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\akash\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Input Text

In [2]:

text = "Don't worry! Loving #NLP @OpenAI ðŸ˜Š natural language processing"
print(text)


Don't worry! Loving #NLP @OpenAI ðŸ˜Š natural language processing


## Tokenization

### 1) Whitespace Tokenizer

In [3]:

wt = WhitespaceTokenizer()
print(wt.tokenize(text))


["Don't", 'worry!', 'Loving', '#NLP', '@OpenAI', 'ðŸ˜Š', 'natural', 'language', 'processing']


### 2) Punctuation Tokenizer

In [4]:

print(wordpunct_tokenize(text))


['Don', "'", 't', 'worry', '!', 'Loving', '#', 'NLP', '@', 'OpenAI', 'ðŸ˜Š', 'natural', 'language', 'processing']


### 3) Treebank Tokenizer

In [5]:

twt = TreebankWordTokenizer()
print(twt.tokenize(text))


['Do', "n't", 'worry', '!', 'Loving', '#', 'NLP', '@', 'OpenAI', 'ðŸ˜Š', 'natural', 'language', 'processing']


### 4) Tweet Tokenizer

In [6]:

tweet = TweetTokenizer()
print(tweet.tokenize(text))


["Don't", 'worry', '!', 'Loving', '#NLP', '@OpenAI', 'ðŸ˜Š', 'natural', 'language', 'processing']


### 5) MWE Tokenizer

In [7]:

mwe = MWETokenizer([('natural', 'language')], separator='_')
print(mwe.tokenize(text.split()))


["Don't", 'worry!', 'Loving', '#NLP', '@OpenAI', 'ðŸ˜Š', 'natural_language', 'processing']


## Stemming

In [8]:

words = ['running', 'studies', 'connection', 'better']
print(words)


['running', 'studies', 'connection', 'better']


### 1) Porter Stemmer

In [9]:

ps = PorterStemmer()
for w in words:
    print(w, "â†’", ps.stem(w))


running â†’ run
studies â†’ studi
connection â†’ connect
better â†’ better


### 2) Snowball Stemmer

In [10]:

ss = SnowballStemmer("english")
for w in words:
    print(w, "â†’", ss.stem(w))


running â†’ run
studies â†’ studi
connection â†’ connect
better â†’ better


## Lemmatization (WordNet)

In [11]:

lemmatizer = WordNetLemmatizer()
for w in words:
    print(w, "â†’", lemmatizer.lemmatize(w))


running â†’ running
studies â†’ study
connection â†’ connection
better â†’ better


## Conclusion
Tokenization splits text, Stemming reduces words to roots, and Lemmatization provides meaningful base words.