## Stemming 
Stemming is a text preprocessing technique in natural language processing (NLP) that eliminates prefixes and suffixes from words, transforming them into their fundamental or root form.

[dancing, dancer, danced, dances] -> dance


In [1]:
words = [
    "Running",
    "Jumps",
    "Easily",
    "Friendliness",
    "Happily",
    "Eating",
    "Studies",
    "Caring",
    "Organizations",
    "Cried",
    "Flying",
    "Houses",
    "Wondered",
    "Playing",
    "Quicker",
    "Wolves",
    "Bigger",
    "Amazement",
    "Completely",
    "Simplified",
    "Eaten",
    "History"
]

### Porter Stemmer
It has disadvantage.

In [19]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


True

In [9]:
from nltk.stem import PorterStemmer

In [11]:
porter_stemmer = PorterStemmer()

In [12]:
for word in words:
    print(word + ' ---> ' + porter_stemmer.stem(word))

Running ---> run
Jumps ---> jump
Easily ---> easili
Friendliness ---> friendli
Happily ---> happili
Eating ---> eat
Studies ---> studi
Caring ---> care
Organizations ---> organ
Cried ---> cri
Flying ---> fli
Houses ---> hous
Wondered ---> wonder
Playing ---> play
Quicker ---> quicker
Wolves ---> wolv
Bigger ---> bigger
Amazement ---> amaz
Completely ---> complet
Simplified ---> simplifi
Eaten ---> eaten
History ---> histori


In [2]:
from nltk.stem import RegexpStemmer

In [3]:
reg_stemmer = RegexpStemmer('ing$|s$|ables$', min=4)

In [4]:
reg_stemmer.stem('eating')

'eat'

## Snowball Stemmer

In [5]:
from nltk.stem import SnowballStemmer

In [6]:
snowball_stemmer = SnowballStemmer('english')

In [7]:
for word in words:
    print(word + "----->", snowball_stemmer.stem(word))

Running-----> run
Jumps-----> jump
Easily-----> easili
Friendliness-----> friendli
Happily-----> happili
Eating-----> eat
Studies-----> studi
Caring-----> care
Organizations-----> organ
Cried-----> cri
Flying-----> fli
Houses-----> hous
Wondered-----> wonder
Playing-----> play
Quicker-----> quicker
Wolves-----> wolv
Bigger-----> bigger
Amazement-----> amaz
Completely-----> complet
Simplified-----> simplifi
Eaten-----> eaten
History-----> histori


In [13]:
porter_stemmer.stem("fairly"), porter_stemmer.stem("sportingly")

('fairli', 'sportingli')

In [15]:
snowball_stemmer.stem("fairly"), snowball_stemmer.stem("sportingly") # It is better than porter stemmer

('fair', 'sport')

# Wordnet Lemmatization 
Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Lemmatization is better than stem as it takes a root word rather than root stem.

In [16]:
from nltk.stem import WordNetLemmatizer

In [17]:
lemmatizer = WordNetLemmatizer()

In [20]:
lemmatizer.lemmatize('going')

'going'

In [21]:
lemmatizer.lemmatize('going', pos='v')

'go'

In [22]:
for word in words:
    print(word + "--->" + lemmatizer.lemmatize(word))

Running--->Running
Jumps--->Jumps
Easily--->Easily
Friendliness--->Friendliness
Happily--->Happily
Eating--->Eating
Studies--->Studies
Caring--->Caring
Organizations--->Organizations
Cried--->Cried
Flying--->Flying
Houses--->Houses
Wondered--->Wondered
Playing--->Playing
Quicker--->Quicker
Wolves--->Wolves
Bigger--->Bigger
Amazement--->Amazement
Completely--->Completely
Simplified--->Simplified
Eaten--->Eaten
History--->History


In [23]:
for word in words:
    print(word + "--->" + lemmatizer.lemmatize(word, pos='v'))

Running--->Running
Jumps--->Jumps
Easily--->Easily
Friendliness--->Friendliness
Happily--->Happily
Eating--->Eating
Studies--->Studies
Caring--->Caring
Organizations--->Organizations
Cried--->Cried
Flying--->Flying
Houses--->Houses
Wondered--->Wondered
Playing--->Playing
Quicker--->Quicker
Wolves--->Wolves
Bigger--->Bigger
Amazement--->Amazement
Completely--->Completely
Simplified--->Simplified
Eaten--->Eaten
History--->History


In [24]:
lemmatizer.lemmatize("fairly"), lemmatizer.lemmatize("sportingly") 

('fairly', 'sportingly')