# Stemming

#### Stemming is the process of reducing a word to its root or base form by chopping off prefixes or suffixes. It's a common technique used in Natural Language Processing (NLP) and Information Retrieval to treat related words as the same word.

In [None]:
words = ["playing","played",
"player","happiness","unhappy","studies","studying",
"quickly","running","runner","better","flying","flies"]

## PorterStemmer

In [6]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
for word in words:
    print(f"{word} ---> {stemmer.stem(word)}")

playing ---> play
played ---> play
player ---> player
happiness ---> happi
unhappy ---> unhappi
studies ---> studi
studying ---> studi
quickly ---> quickli
running ---> run
runner ---> runner
better ---> better
flying ---> fli
flies ---> fli
denied ---> deni
denying ---> deni
organization ---> organ
organizational ---> organiz
caring ---> care
cared ---> care
hoping ---> hope


In [None]:
# some disadvantage of stemming is tha it can produce words that are not real words
# for example, "playing" becomes "play" which is fine, but "happiness" becomes "happi"
# which is not a real word. This can lead to confusion in some applications.
# However, stemming is often used in information retrieval and natural language processing tasks
# where the exact form of the word is less important than its root meaning.

## RegexpStemmer

#### The RegexpStemmer (Regular Expression Stemmer) in NLTK is a rule-based stemmer that removes suffixes from words using regular expressions. It's customizable, meaning you define your own stripping rules — making it lightweight and flexible for simple stemming tasks.

In [10]:
from nltk.stem import RegexpStemmer
regexpstemmer = RegexpStemmer(regexp='ing$|ed$|es$|s$|ness$|ly$|al$',min=4)
for word in words:
    print(f"{word} ---> {regexpstemmer.stem(word)}")

playing ---> play
played ---> play
player ---> player
happiness ---> happi
unhappy ---> unhappy
studies ---> studi
studying ---> study
quickly ---> quick
running ---> runn
runner ---> runner
better ---> better
flying ---> fly
flies ---> fli
denied ---> deni
denying ---> deny
organization ---> organization
organizational ---> organization
caring ---> car
cared ---> car
hoping ---> hop


## SnowballStemmer

#### The Snowball Stemmer is a more advanced and improved version of the original Porter Stemmer. It’s also known as Porter2. It is more consistent, supports multiple languages, and is generally less aggressive than the Lancaster stemmer, making it ideal for most NLP tasks where meaningful root words are desired.

In [12]:
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer(language="english")
for word in words:
    print(f"{word} ---> {snowball_stemmer.stem(word)}")

playing ---> play
played ---> play
player ---> player
happiness ---> happi
unhappy ---> unhappi
studies ---> studi
studying ---> studi
quickly ---> quick
running ---> run
runner ---> runner
better ---> better
flying ---> fli
flies ---> fli
denied ---> deni
denying ---> deni
organization ---> organ
organizational ---> organiz
caring ---> care
cared ---> care
hoping ---> hope


##### Main disadvantage of most of the stemming algorithms is that they can be too agressive and produce non-real words or words which have no meaning ans sometimes the words meaning gets changed. And that can be solved using Lemmatizatiom