# Text Normalization

We use lots of variations for same words for examples user might express love in different ways `[lovely, luv, loveee, ...]` all these are the essentially the same to a machine learning model that tries to find out if a review is positive or negative.

So in this notebook we will discuss some methods to reduce text variations.

# Stemming

Stemming reduce the text  by a set of pre-defined rules like removing `ing` from verbs.

Definition: Stemming is the process of removing prefixes and suffixes from words in order to obtain their common base form. The result of stemming is often a root word, which may not be a valid word but still represents the core meaning of the word.

Simplicity: Stemming is a heuristic process that applies simple rules to reduce words. It doesn't consider the context of the word.

Example: For example, stemming might reduce words like "jumping," "jumps," and "jumped" to the root "jump."

Speed: Stemming is generally faster and computationally less intensive compared to lemmatization.

Use Cases: Stemming is commonly used in information retrieval tasks, such as search engines and document retrieval, where speed is crucial. It's also used in text classification and clustering.

In [1]:
# Import the PorterStemmer class from the nltk.stem.porter module
from nltk.stem.porter import PorterStemmer

# Create an instance of the PorterStemmer class
stemmer = PorterStemmer()

# Define a list of plural words to be stemmed
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
           'plotted']

# Use the stemmer to stem each word in the 'plurals' list and store the results in 'singles'
singles = [stemmer.stem(plural) for plural in plurals]

# Print the stemmed words, joined by commas
print(', '.join(singles))


caress, fli, die, mule, deni, die, agre, own, humbl, size, meet, state, siez, item, sensat, tradit, refer, colon, plot


# Snowball stemmer support different languages

In [2]:
from nltk.stem.snowball import SnowballStemmer

print(", ".join(SnowballStemmer.languages))

arabic, danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, porter, portuguese, romanian, russian, spanish, swedish


In [3]:
ar_stemmer = SnowballStemmer("arabic")
ar_stemmer.stem("الجو سماؤه صافية")

'جو سماوه صاف'

# Lemmatization

Lemmatizer reduce the word by looking it up in the `WordNet` where it tries to find the root of the word for example `rocks` -> `rock`.

Definition: Lemmatization is the process of reducing words to their base or dictionary form (lemma) while considering their context, linguistic rules, and part of speech. The result of lemmatization is a valid word that represents the base form of the word.

Context Awareness: Lemmatization takes into account the context and meaning of words, ensuring that the resulting base form is a real word and not just a character-stripped root.

Example: For example, lemmatization would reduce words like "better," "best," and "good" to the lemma "good" because they all convey a similar meaning.

Accuracy: Lemmatization is more accurate than stemming but can be slower and more computationally intensive because it involves dictionary lookups and linguistic analysis.

Use Cases: Lemmatization is often used in applications where accuracy and interpretation of text are important, such as text summarization, machine translation, sentiment analysis, and language generation.

In [8]:
# Import the WordNetLemmatizer class from the nltk.stem module
from nltk.stem import WordNetLemmatizer

# Create an instance of the WordNetLemmatizer class named 'lemmatizer'
lemmatizer = WordNetLemmatizer()

# Perform lemmatization on different words and print the results

# Lemmatize the word "rocks" (default behavior treats it as a noun)
print("rocks :", lemmatizer.lemmatize("rocks"))

# Lemmatize the word "corpora" (default behavior treats it as a noun)
print("corpora :", lemmatizer.lemmatize("corpora"))

# Lemmatize the word "better" with the part of speech (pos) specified as "a" (adjective)
print("better :", lemmatizer.lemmatize("better", pos="a"))


rocks : rock
corpora : corpus
better : good


# Lemmatization vs Stemming

The key concept here is that stemming sometime destroy the word unlike lemmatization where we keep the meaning.
 
In summary, the key differences between stemming and lemmatization lie in their approaches and outputs. Stemming is a faster but less accurate technique that strips words to their roots, while lemmatization is a more accurate but computationally intensive process that reduces words to valid dictionary forms based on context and part of speech. The choice between stemming and lemmatization depends on the specific NLP task and the trade-offs between speed and accuracy.