<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Natural-Language-Processing/blob/main/Stemming_And_Lemmatization_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Stemming and Lemmatization
Stemming and lemmatization are both techniques used in natural language processing to reduce words to their base forms, but they differ in their approaches and accuracy.

## Stemming:

Reduces words to their base forms using simple rules.
May produce non-words (e.g., "happily" -> "happi").
Fast and suitable for speed-critical applications like real-time search engines.

## Lemmatization:

Reduces words to their dictionary forms considering context and part of speech.
Produces valid words (e.g., "better" -> "good").
More accurate but slower, suitable for accuracy-critical applications like machine translation and text classification.

#Key Difference

Stemming: Fast, less accurate, produces non-words.

Lemmatization: Slower, more accurate, produces valid words.


In [None]:
import nltk
nltk.download('punkt')  # This downloads the necessary resources for tokenization
nltk.download('averaged_perceptron_tagger')  # For part-of-speech tagging
nltk.download('wordnet')  # For lemmatization
nltk.download('stopwords')  # For stopwords


## PorterStemmer:  
PorterStemmer is a widely-used stemming algorithm that reduces words to their base or root form. It follows a set of rules and heuristics to strip affixes from words, aiming to produce the most common stem for related words. It's computationally efficient but may not always produce the most linguistically accurate stems.

In [None]:
# Importing the Natural Language Toolkit (nltk) library for text preprocessing
import nltk

# Importing the PorterStemmer class from the nltk.stem module
# PorterStemmer is used to reduce words to their root form, which helps in text normalization
from nltk.stem import PorterStemmer

# Importing the stopwords from the nltk.corpus module
# Stopwords are common words like "the", "is", "in" that are usually removed in text processing
from nltk.corpus import stopwords

# Downloading the 'punkt' and 'stopwords' packages from nltk
# 'punkt' is a pre-trained model for tokenizing text into sentences and words
# 'stopwords' contains a list of common stopwords for various languages
nltk.download('punkt')
nltk.download('stopwords')

# Defining a paragraph of text about Artificial Intelligence (AI)
# This paragraph will be used to demonstrate text preprocessing techniques
paragraph = """Artificial Intelligence (AI) is a transformative technology that mimics human intelligence
to perform tasks such as learning, reasoning, problem-solving, and decision-making. It encompasses various subfields
including machine learning, natural language processing, computer vision, and robotics. AI systems analyze
vast amounts of data to identify patterns, make predictions, and improve their performance over time through iterative
processes. This technology has vast applications across industries, from healthcare, where it aids in diagnosing
diseases and personalizing treatment plans, to finance, where it enhances fraud detection and automates trading.
AI also powers virtual assistants like Siri and Alexa, self-driving cars, and advanced manufacturing processes.
As AI continues to evolve, it promises to revolutionize the way we live and work, offering unprecedented opportunities
for innovation and efficiency while also posing ethical and societal challenges that must be carefully managed."""

# Tokenizing the paragraph into sentences
# This breaks down the paragraph into individual sentences for further processing
sentences = nltk.sent_tokenize(paragraph)

# Creating an instance of the PorterStemmer
# This instance will be used to stem words, reducing them to their base form
stemmer = PorterStemmer()

# Getting the list of English stopwords
stop_words = set(stopwords.words('english'))

# Iterating over each sentence in the paragraph
for i in range(len(sentences)):
    # Tokenizing each sentence into words
    words = nltk.word_tokenize(sentences[i])

    # Removing stopwords and stemming the remaining words
    stemmed_words = [stemmer.stem(word) for word in words if word.lower() not in stop_words]

    # Joining the stemmed words back into a sentence
    sentences[i] = ' '.join(stemmed_words)

# Printing the processed sentences
for sentence in sentences:
    print(sentence)


## LancasterStemmer:

LancasterStemmer is another stemming algorithm that, like PorterStemmer, reduces words to their base form. However, it tends to be more aggressive in its stemming process, which can sometimes lead to stems that are less intuitive or natural compared to PorterStemmer. It's known for its fast execution and aggressive stemming rules.

In [None]:
# Importing the Natural Language Toolkit (nltk) library for text preprocessing
import nltk

# Importing the LancasterStemmer class from the nltk.stem module
# LancasterStemmer is used to reduce words to their root form, which helps in text normalization
from nltk.stem import LancasterStemmer

# Importing the stopwords from the nltk.corpus module
# Stopwords are common words like "the", "is", "in" that are usually removed in text processing
from nltk.corpus import stopwords

# Downloading the 'punkt' and 'stopwords' packages from nltk
# 'punkt' is a pre-trained model for tokenizing text into sentences and words
# 'stopwords' contains a list of common stopwords for various languages
nltk.download('punkt')
nltk.download('stopwords')

# Defining a paragraph of text about Artificial Intelligence (AI)
# This paragraph will be used to demonstrate text preprocessing techniques
paragraph = """Artificial Intelligence (AI) is a transformative technology that mimics human intelligence
to perform tasks such as learning, reasoning, problem-solving, and decision-making. It encompasses various subfields
including machine learning, natural language processing, computer vision, and robotics. AI systems analyze
vast amounts of data to identify patterns, make predictions, and improve their performance over time through iterative
processes. This technology has vast applications across industries, from healthcare, where it aids in diagnosing
diseases and personalizing treatment plans, to finance, where it enhances fraud detection and automates trading.
AI also powers virtual assistants like Siri and Alexa, self-driving cars, and advanced manufacturing processes.
As AI continues to evolve, it promises to revolutionize the way we live and work, offering unprecedented opportunities
for innovation and efficiency while also posing ethical and societal challenges that must be carefully managed."""

# Tokenizing the paragraph into sentences
# This breaks down the paragraph into individual sentences for further processing
sentences = nltk.sent_tokenize(paragraph)

# Creating an instance of the LancasterStemmer
# This instance will be used to stem words, reducing them to their base form
stemmer = LancasterStemmer()

# Getting the list of English stopwords
stop_words = set(stopwords.words('english'))

# Iterating over each sentence in the paragraph
for i in range(len(sentences)):
    # Tokenizing each sentence into words
    words = nltk.word_tokenize(sentences[i])

    # Removing stopwords and stemming the remaining words
    stemmed_words = [stemmer.stem(word) for word in words if word.lower() not in stop_words]

    # Joining the stemmed words back into a sentence
    sentences[i] = ' '.join(stemmed_words)

# Printing the processed sentences
for sentence in sentences:
    print(sentence)


# RegexpStemmer:

RegexpStemmer is a stemming algorithm provided by NLTK that allows customization using regular expressions. It enables specific rules for stemming, making it suitable for tasks where tailored patterns and transformations are required.

In [None]:
# Importing the Natural Language Toolkit (nltk) library for text preprocessing
import nltk

# Importing the RegexpStemmer class from the nltk.stem module
# RegexpStemmer is used to reduce words to their root form based on regular expressions
from nltk.stem import RegexpStemmer

# Importing the stopwords from the nltk.corpus module
# Stopwords are common words like "the", "is", "in" that are usually removed in text processing
from nltk.corpus import stopwords

# Downloading the 'punkt' and 'stopwords' packages from nltk
# 'punkt' is a pre-trained model for tokenizing text into sentences and words
# 'stopwords' contains a list of common stopwords for various languages
nltk.download('punkt')
nltk.download('stopwords')

# Defining a paragraph of text about Artificial Intelligence (AI)
# This paragraph will be used to demonstrate text preprocessing techniques
paragraph = """Artificial Intelligence (AI) is a transformative technology that mimics human intelligence
to perform tasks such as learning, reasoning, problem-solving, and decision-making. It encompasses various subfields
including machine learning, natural language processing, computer vision, and robotics. AI systems analyze
vast amounts of data to identify patterns, make predictions, and improve their performance over time through iterative
processes. This technology has vast applications across industries, from healthcare, where it aids in diagnosing
diseases and personalizing treatment plans, to finance, where it enhances fraud detection and automates trading.
AI also powers virtual assistants like Siri and Alexa, self-driving cars, and advanced manufacturing processes.
As AI continues to evolve, it promises to revolutionize the way we live and work, offering unprecedented opportunities
for innovation and efficiency while also posing ethical and societal challenges that must be carefully managed."""

# Tokenizing the paragraph into sentences
# This breaks down the paragraph into individual sentences for further processing
sentences = nltk.sent_tokenize(paragraph)

# Creating an instance of the RegexpStemmer
# This instance will be used to stem words, reducing them to their base form based on a regular expression
# The regular expression removes common suffixes
stemmer = RegexpStemmer('ing$|s$|ed$|er$|ly$', min=4)

# Getting the list of English stopwords
stop_words = set(stopwords.words('english'))

# Iterating over each sentence in the paragraph
for i in range(len(sentences)):
    # Tokenizing each sentence into words
    words = nltk.word_tokenize(sentences[i])

    # Removing stopwords and stemming the remaining words
    stemmed_words = [stemmer.stem(word) for word in words if word.lower() not in stop_words]

    # Joining the stemmed words back into a sentence
    sentences[i] = ' '.join(stemmed_words)

# Printing the processed sentences
for sentence in sentences:
    print(sentence)


## SnowballStemmer:
SnowballStemmer, also known as Porter2 or Martin Porter's stemmer, is an extension and improvement upon the original PorterStemmer algorithm. It supports stemming in multiple languages and provides more accurate stems for many words compared to PorterStemmer. It's designed to be more efficient and linguistically precise.

In [None]:
# Importing the Natural Language Toolkit (nltk) library for text preprocessing
import nltk

# Importing the SnowballStemmer class from the nltk.stem module
# SnowballStemmer is used to reduce words to their root form
from nltk.stem import SnowballStemmer

# Importing the stopwords from the nltk.corpus module
# Stopwords are common words like "the", "is", "in" that are usually removed in text processing
from nltk.corpus import stopwords

# Downloading the 'punkt' and 'stopwords' packages from nltk
# 'punkt' is a pre-trained model for tokenizing text into sentences and words
# 'stopwords' contains a list of common stopwords for various languages
nltk.download('punkt')
nltk.download('stopwords')

# Defining a paragraph of text about Artificial Intelligence (AI)
# This paragraph will be used to demonstrate text preprocessing techniques
paragraph = """Artificial Intelligence (AI) is a transformative technology that mimics human intelligence
to perform tasks such as learning, reasoning, problem-solving, and decision-making. It encompasses various subfields
including machine learning, natural language processing, computer vision, and robotics. AI systems analyze
vast amounts of data to identify patterns, make predictions, and improve their performance over time through iterative
processes. This technology has vast applications across industries, from healthcare, where it aids in diagnosing
diseases and personalizing treatment plans, to finance, where it enhances fraud detection and automates trading.
AI also powers virtual assistants like Siri and Alexa, self-driving cars, and advanced manufacturing processes.
As AI continues to evolve, it promises to revolutionize the way we live and work, offering unprecedented opportunities
for innovation and efficiency while also posing ethical and societal challenges that must be carefully managed."""

# Tokenizing the paragraph into sentences
# This breaks down the paragraph into individual sentences for further processing
sentences = nltk.sent_tokenize(paragraph)

# Creating an instance of the SnowballStemmer for the English language
# This instance will be used to stem words, reducing them to their base form
stemmer = SnowballStemmer('english')

# Getting the list of English stopwords
stop_words = set(stopwords.words('english'))

# Iterating over each sentence in the paragraph
for i in range(len(sentences)):
    # Tokenizing each sentence into words
    words = nltk.word_tokenize(sentences[i])

    # Removing stopwords and stemming the remaining words
    stemmed_words = [stemmer.stem(word) for word in words if word.lower() not in stop_words]

    # Joining the stemmed words back into a sentence
    sentences[i] = ' '.join(stemmed_words)

# Printing the processed sentences
for sentence in sentences:
    print(sentence)


# Lemmatization

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')  # Optional: for tokenizing sentences


In [None]:
from nltk.stem import WordNetLemmatizer

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()


In [None]:
words = ["running", "jumps", "easily", "fairly", "better"]

for word in words:
    lemma = lemmatizer.lemmatize(word ,pos="v")
    print(f"Original: {word}, Lemma: {lemma}")
