###*60009220131 Sayantan Mukherjee D2-2*

In [2]:
!pip install nltk spacy
!python -m spacy download en_core_web_sm

import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer, RegexpStemmer
from nltk.stem import WordNetLemmatizer
import spacy

nltk.download('wordnet')
nltk.download('omw-1.4')
nlp = spacy.load("en_core_web_sm")

Collecting en-core-web-sm==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [3]:
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")
regex_stemmer = RegexpStemmer('ing$|s$|e$', min=4)
wordnet_lemmatizer = WordNetLemmatizer()

In [5]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [7]:
text = "Hello guys, My name is Sayantan. This is my 3rd Experimental implemenation of NLP lab. This lab involves using stemming and lemmatization on word text."
words = nltk.word_tokenize(text)

###Porter Stemmer
Description : The Porter Stemmer is one of the oldest and most widely used stemming algorithms. It applies a series of heuristic rules to remove common suffixes (e.g., -ing, -ed, -s) from English words.

How it works : It uses a set of five phases of word reductions, each phase applying a series of rules to modify the word.

Example :
Input: chasing

Output: chase

Differences :
Simple and fast but may over-stem or under-stem in some cases.
Less aggressive compared to Lancaster Stemmer.

In [8]:
porter_stems = [porter.stem(word) for word in words]
print("Porter Stemmer:", porter_stems)

Porter Stemmer: ['hello', 'guy', ',', 'my', 'name', 'is', 'sayantan', '.', 'thi', 'is', 'my', '3rd', 'experiment', 'implemen', 'of', 'nlp', 'lab', '.', 'thi', 'lab', 'involv', 'use', 'stem', 'and', 'lemmat', 'on', 'word', 'text', '.']


###Lancaster Stemmer
Description : The Lancaster Stemmer is more aggressive than the Porter Stemmer. It applies a larger set of rules and can produce shorter stems.

How it works : It iteratively applies rules until no further changes can be made.
Example :
Input: chasing

Output: chas

Differences :
More aggressive than Porter Stemmer, often producing shorter stems.
Can sometimes over-stem, leading to loss of meaningful distinctions between words.

In [9]:
lancaster_stems = [lancaster.stem(word) for word in words]
print("Lancaster Stemmer:", lancaster_stems)

Lancaster Stemmer: ['hello', 'guy', ',', 'my', 'nam', 'is', 'say', '.', 'thi', 'is', 'my', '3rd', 'expery', 'implem', 'of', 'nlp', 'lab', '.', 'thi', 'lab', 'involv', 'us', 'stem', 'and', 'lem', 'on', 'word', 'text', '.']


###Snowball Stemmer
Description : Also known as the "Porter2" Stemmer, it is an improved version of the Porter Stemmer. It supports multiple languages and is more accurate.

How it works : Similar to Porter Stemmer but with additional refinements and better handling of edge cases.

Example :
Input: chasing

Output: chase

Differences :
More accurate and robust than Porter Stemmer.
Supports stemming for multiple languages, not just English.

In [10]:
snowball_stems = [snowball.stem(word) for word in words]
print("Snowball Stemmer:", snowball_stems)

Snowball Stemmer: ['hello', 'guy', ',', 'my', 'name', 'is', 'sayantan', '.', 'this', 'is', 'my', '3rd', 'experiment', 'implemen', 'of', 'nlp', 'lab', '.', 'this', 'lab', 'involv', 'use', 'stem', 'and', 'lemmat', 'on', 'word', 'text', '.']


###Regex Stemmer
Description : The Regex Stemmer allows you to define custom regular expressions for stemming. It removes suffixes based on user-defined patterns.

How it works : You specify a regex pattern (e.g., 'ing$|s$|e$') to match and remove specific endings from words.

Example :
Input: chasing

Output: chase (if the regex removes -ing)

Differences :
Fully customizable but requires manual definition of rules.
Useful for domain-specific stemming but less general-purpose than other stemmers.

In [11]:
regex_stems = [regex_stemmer.stem(word) for word in words]
print("Regex Stemmer:", regex_stems)

Regex Stemmer: ['Hello', 'guy', ',', 'My', 'nam', 'is', 'Sayantan', '.', 'Thi', 'is', 'my', '3rd', 'Experimental', 'implemenation', 'of', 'NLP', 'lab', '.', 'Thi', 'lab', 'involve', 'us', 'stemm', 'and', 'lemmatization', 'on', 'word', 'text', '.']


###WordNet Lemmatizer
Description : The WordNet Lemmatizer reduces words to their base or dictionary form (lemma) using the WordNet lexical database.

How it works : It looks up the word in WordNet and returns its lemma based on part-of-speech (POS) tags.

Example :

Input: chasing

Output: chase (if POS is verb)

Differences :
Produces valid dictionary words, unlike stemmers which may produce non-words.
Requires POS tagging for optimal results, making it slower than stemmers.

In [12]:
wordnet_lemmas = [wordnet_lemmatizer.lemmatize(word) for word in words]
print("WordNet Lemmatizer:", wordnet_lemmas)

WordNet Lemmatizer: ['Hello', 'guy', ',', 'My', 'name', 'is', 'Sayantan', '.', 'This', 'is', 'my', '3rd', 'Experimental', 'implemenation', 'of', 'NLP', 'lab', '.', 'This', 'lab', 'involves', 'using', 'stemming', 'and', 'lemmatization', 'on', 'word', 'text', '.']


###Spacy Lemmatizer
Description : Spacy's lemmatizer is part of its advanced NLP pipeline. It uses machine learning models to determine the lemma of a word based on context.

How it works : It processes the entire sentence and determines the lemma of each word using its trained model.

Example :

Input: chasing

Output: chase

Differences :
Context-aware, meaning it considers the surrounding words to determine the correct lemma.
More accurate than WordNet Lemmatizer but computationally heavier.

In [13]:
spacy_lemmas = [token.lemma_ for token in nlp(text)]
print("Spacy Lemmatizer:", spacy_lemmas)

Spacy Lemmatizer: ['hello', 'guy', ',', 'my', 'name', 'be', 'Sayantan', '.', 'this', 'be', 'my', '3rd', 'experimental', 'implemenation', 'of', 'NLP', 'lab', '.', 'this', 'lab', 'involve', 'use', 'stem', 'and', 'lemmatization', 'on', 'word', 'text', '.']
