## Stemming:
Stemming is the process of reducing a word to its base or root form, usually by removing suffixes. It's a form of text normalization that helps treat related words (like "connect", "connected", "connection") as the same root term.

## Stemming Type:
<ul>
  <li><strong>Rule-Based (Porter Stemmer):</strong> Uses a set of rules to remove common suffixes.</li>
  <li><strong>Snowball Stemmer:</strong> Improved version of Porter stemmer with support for multiple languages.</li>
  <li><strong>Lancaster Stemmer:</strong> More aggressive and faster than Porter. Can over-stem.</li>
  <li><strong>Regex-Based Stemming:</strong> Uses custom regex rules to define stems. Very flexible but less standardized.</li>
  <li><strong>Lemmatization (not exactly stemming):</strong> Uses vocabulary + grammar to find the correct base form. More accurate.</li>
</ul>

In [1]:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\roger\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\roger\AppData\Roaming\nltk_data...


True

In [5]:
import nltk
from nltk.tokenize import PunktSentenceTokenizer,WordPunctTokenizer
nltk.download('punkt')

# Load pretrained tokenizer for English
tokenizer1 = PunktSentenceTokenizer()
tokenizer2 = WordPunctTokenizer()
text = "Natural Language Processing is fun. Let's explore tokenization with NLTK!"
sentences = tokenizer1.tokenize(text)

print(sentences)


['Natural Language Processing is fun.', "Let's explore tokenization with NLTK!"]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\roger\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [12]:
text2 = "The children are playing, studied, and flying in different directions."
tokens = tokenizer2.tokenize(text2)
print("Original Tokens:", tokens)

Original Tokens: ['The', 'children', 'are', 'playing', ',', 'studied', ',', 'and', 'flying', 'in', 'different', 'directions', '.']


## PorterStemmer

In [17]:
from nltk.stem import PorterStemmer
stemming=PorterStemmer()

In [20]:
stemming.stem("Congratulation")

'congratul'

## Snowball Stemmer:

In [23]:
from nltk.stem import SnowballStemmer

# Specify the language (e.g., English)
snowstemming = SnowballStemmer("english")

# Example usage
word = "running"
print(snowstemming.stem(word))  # Output: run


run


## Lancaster Stemmer:

In [24]:
from nltk.stem import LancasterStemmer

# Create stemmer instance
lancaster = LancasterStemmer()

# Example words
words = ["running", "flies", "fairly", "maximum", "studies"]

# Apply stemming
stems = [lancaster.stem(word) for word in words]

print("Original Words:", words)
print("Lancaster Stems:", stems)


Original Words: ['running', 'flies', 'fairly', 'maximum', 'studies']
Lancaster Stems: ['run', 'fli', 'fair', 'maxim', 'study']


## Regex-Based Stemming: 

In [25]:
from nltk.stem import RegexpStemmer

# Define custom regex patterns for stemming
regex_stemmer = RegexpStemmer('ing$|ed$|s$')

# Example words to stem
words = ["running", "played", "flies", "studies", "happier"]

# Apply stemming
stems = [regex_stemmer.stem(word) for word in words]

print("Original Words:", words)
print("Regex Stems:", stems)


Original Words: ['running', 'played', 'flies', 'studies', 'happier']
Regex Stems: ['runn', 'play', 'flie', 'studie', 'happier']


## Lemmatization (not exactly stemming):

In [27]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Example text
text = "The children are playing and they have been running for hours."

# Lemmatize each word (using 'v' for verb lemmatization)
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in tokens]

print("Original Tokens:", tokens)
print("Lemmatized Words:", lemmatized_words)


Original Tokens: ['The', 'children', 'are', 'playing', ',', 'studied', ',', 'and', 'flying', 'in', 'different', 'directions', '.']
Lemmatized Words: ['The', 'children', 'be', 'play', ',', 'study', ',', 'and', 'fly', 'in', 'different', 'directions', '.']
