# Text Normalization

We use lots of variations for same words for examples user might express love in different ways `[lovely, luv, loveee, ...]` all these are the essentially the same to a machine learning model that tries to find out if a review is positive or negative.

So in this notebook we will discuss some methods to reduce text variations.

# Stemming

Stemming reduce the text by a set of pre-defined rules like removing `ing` from verbs

In [3]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
           'plotted']

singles = [stemmer.stem(plural) for plural in plurals]
print(', '.join(singles))

caress, fli, die, mule, deni, die, agre, own, humbl, size, meet, state, siez, item, sensat, tradit, refer, colon, plot


# Snowball stemmer support different languages

In [6]:
from nltk.stem.snowball import SnowballStemmer

print(", ".join(SnowballStemmer.languages))

arabic, danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, porter, portuguese, romanian, russian, spanish, swedish


In [10]:
ar_stemmer = SnowballStemmer("arabic")
ar_stemmer.stem("الجو سماؤه صافية")

'جو سماوه صاف'

# Lemmatization

Lemmatizer reduce the word by looking it up in the `WordNet` where it tries to find the root of the word for example `rocks` -> `rock`

In [11]:
# import these modules
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))

rocks : rock
corpora : corpus
better : good


# Lemmatization vs Stemming

The key concept here is that stemming sometime destroy the word unlike lemmatization where we keep the meaning.