Lemmatization and stemming are techniques used in natural language processing to reduce words to their base or root form. These techniques are particularly useful for text processing and analysis. Let's demonstrate how to perform lemmatization and stemming using the NLTK.

# Step 1: Install NLTK


In [1]:
!pip install nltk




# Step 2: Lemmatization


In [2]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')

# Sample text for lemmatization
text = "Lemmatization is a technique that reduces words to their base form."

# Tokenize the text using NLTK
words = word_tokenize(text)

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Perform lemmatization
lemmas = [lemmatizer.lemmatize(word) for word in words]

# Print original text and lemmas
print("Original Text:", text)
print("\nLemmas:")
print(lemmas)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shrim\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\shrim\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original Text: Lemmatization is a technique that reduces words to their base form.

Lemmas:
['Lemmatization', 'is', 'a', 'technique', 'that', 'reduces', 'word', 'to', 'their', 'base', 'form', '.']


# Step 3: Stemming


In [3]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Sample text for stemming
text = "Stemming is a technique that reduces words to their root form."

# Tokenize the text using NLTK
words = word_tokenize(text)

# Initialize the Porter Stemmer
porter_stemmer = PorterStemmer()

# Perform stemming
stems = [porter_stemmer.stem(word) for word in words]

# Print original text and stems
print("Original Text:", text)
print("\nStems:")
print(stems)


Original Text: Stemming is a technique that reduces words to their root form.

Stems:
['stem', 'is', 'a', 'techniqu', 'that', 'reduc', 'word', 'to', 'their', 'root', 'form', '.']


#The Porter, Snowball, and Lancaster stemmers are three different algorithms for stemming, each with its own characteristics. Here's a brief overview of the differences between them:

1. Porter Stemmer:
Developer: Developed by Martin Porter.
Characteristics:
It's one of the oldest and most well-known stemming algorithms.
It uses a set of heuristic rules to strip common suffixes from English words.
It is relatively simple and efficient.
Example:
"running" → "run"
"happiness" → "happi"


2. Snowball Stemmer (Porter2 or English Stemmer):
Developer: Also developed by Martin Porter but as part of the Snowball stemmer project.
Characteristics:
An extension of the Porter stemmer with some improvements.
It supports multiple languages, not just English.
It is more aggressive than the Porter stemmer.
Example:
"running" → "run"
"happiness" → "happi"


3. Lancaster Stemmer:
Developer: Developed by Chris D. Paice.
Characteristics:
It is the most aggressive of the three stemmers.
It uses a different set of rules compared to the Porter and Snowball stemmers.
It tends to produce shorter stems, but they may not be valid words.
Example:
"running" → "run"
"happiness" → "happy"
Differences:
Aggressiveness:

Porter: Moderate
Snowball: More aggressive than Porter
Lancaster: Most aggressive
Rules:

Porter and Snowball share some similarities in rules but differ in specific details.
Lancaster uses a different set of rules compared to Porter and Snowball.
Efficiency:

Porter and Snowball are more widely used due to their balance between aggressiveness and efficiency.
Lancaster is the most aggressive but may be less efficient for certain tasks.
Word Length:

Lancaster tends to produce shorter stems but may result in stems that are not valid words.

In [4]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
from nltk.tokenize import word_tokenize

# Create stemmers for Porter, Snowball, and Lancaster algorithms
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")  # SnowballStemmer supports multiple languages, "english" is specified here
lancaster_stemmer = LancasterStemmer()

# Example sentence
sentence = "Stemming is a technique that reduces words to their base or root form."

# Tokenize the sentence
words = word_tokenize(sentence)

# Stem each word in the sentence using Porter, Snowball, and Lancaster stemmers
porter_stemmed_sentence = " ".join([porter_stemmer.stem(word) for word in words])
snowball_stemmed_sentence = " ".join([snowball_stemmer.stem(word) for word in words])
lancaster_stemmed_sentence = " ".join([lancaster_stemmer.stem(word) for word in words])

# Print the results
print("Original sentence:", sentence)
print("Porter Stemmed sentence:", porter_stemmed_sentence)
print("Snowball Stemmed sentence:", snowball_stemmed_sentence)
print("Lancaster Stemmed sentence:", lancaster_stemmed_sentence)


Original sentence: Stemming is a technique that reduces words to their base or root form.
Porter Stemmed sentence: stem is a techniqu that reduc word to their base or root form .
Snowball Stemmed sentence: stem is a techniqu that reduc word to their base or root form .
Lancaster Stemmed sentence: stem is a techn that reduc word to their bas or root form .
