# Understanding word normalization

Most of the time, we don't want to have every individual word fragment that we have ever encountered in our vocabulary. 

We could want this for several reasons, one being the need to correctly distinguish (for example) the phrase `U.N.` (with characters separated by a period) from `UN` (without any periods).

We can also bring words to their root form in the dictionary. For instance, `am`, `are`, and `is` can be identified by their root form, `be`. 

On another front, we can remove inflections from words to bring them down to the same form. Words `car`, `cars`, and `car's` can all be identified as `car`.

Also, common words that occur very frequently and do not convey much meaning, such as the articles `a`, `an`, and `the`, can be removed. 

However, all these highly depend on the use cases. `Wh- words`, such as `when`, `why`, `where`, and `who`, do not carry much information in most contexts and are removed as part of a technique called `stopwordremoval`, which we will see a little later in the Stopword removal section.

However, in situations such as question classification and question answering, these words become very important and should not be removed. 

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import nltk

In [3]:
plurals = ['caresses', 'flies', 'dies', 'mules', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating',
           'siezing', 'itemization', 'traditional', 'reference', 'colonizer', 'plotted', 'having', 'generously']

# Porter Stemmer

In [4]:
from nltk.stem.porter import PorterStemmer 

stemmer = PorterStemmer()
singles = [stemmer.stem(plural) for plural in plurals]

print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have gener


# Snowball Stemmer

In [5]:
from nltk.stem.snowball import SnowballStemmer

print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [6]:
stemmer2 = SnowballStemmer(language='english')
singles = [stemmer2.stem(plural) for plural in plurals]

print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have generous


# OVER-STEMMING AND UNDER-STEMMING

Potential problems with stemming arise in the form of over-stemming and under-stemming. 

A situation may arise when words that are stemmed to the same root should have been stemmed to different roots. This problem is referred to as over-stemming. 

In contrast, another problem occurs when words that should have been stemmed to the same root aren't stemmed to it. This situation is referred to as under-stemming.

# Wordnet Lemmatizer

Unlike stemming, wherein a few characters are removed from words using crude methods, lemmatization is a process wherein the context is used to convert a word to its meaningful base form. 

It helps in grouping together words that have a common base form and so can be identified as a single item. The base form is referred to as the lemma of the word and is also sometimes known as the dictionary form.

Lemmatization algorithms try to identify the lemma form of a word by taking into account the neighborhood context of the word, part-of-speech (POS) tags, the meaning of a word, and so on. The neighborhood can span across words in the vicinity, sentences, or even documents.

In [7]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\anamini\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
# WordNet is a lexical database of English that is freely and publicly available. As  part of WordNet, 
# nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), 
# each expressing distinct concepts. These synsets are interlinked using lexical and conceptual semantic relationships. 

lemmatizer = WordNetLemmatizer()

s = "We are putting in efforts to enhance our understanding of Lemmatization"
token_list = s.split()
print("Tokens: ", token_list)

print("Original: ", s)
lemmatized_output = ' '.join([lemmatizer.lemmatize(token) for token in token_list])
print("Lemmatized: ", lemmatized_output)

Tokens:  ['We', 'are', 'putting', 'in', 'efforts', 'to', 'enhance', 'our', 'understanding', 'of', 'Lemmatization']
Original:  We are putting in efforts to enhance our understanding of Lemmatization
Lemmatized:  We are putting in effort to enhance our understanding of Lemmatization


## Part of Speech (POS) Tagging

In [9]:
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(token_list)
pos_tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\anamini\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('We', 'PRP'),
 ('are', 'VBP'),
 ('putting', 'VBG'),
 ('in', 'IN'),
 ('efforts', 'NNS'),
 ('to', 'TO'),
 ('enhance', 'VB'),
 ('our', 'PRP$'),
 ('understanding', 'NN'),
 ('of', 'IN'),
 ('Lemmatization', 'NN')]

## POS tag Mapping

In [10]:
from nltk.corpus import wordnet

# This is a common method which is widely used across the NLP community of practitioners and readers

def get_part_of_speech_tags(token):
    """Maps POS tags to first character lemmatize() accepts.
    We are focussing on Verbs, Nouns, Adjectives and Adverbs here."""

    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    tag = nltk.pos_tag([token])[0][1][0].upper()
    return tag_dict.get(tag, wordnet.NOUN)

## Wordnet Lemmatizer with POS Tag Information

In [11]:
lemmatized_output_with_POS_information = [lemmatizer.lemmatize(token, get_part_of_speech_tags(token)) for token in token_list]
print(s)
print(' '.join(lemmatized_output_with_POS_information))

We are putting in efforts to enhance our understanding of Lemmatization
We be put in effort to enhance our understand of Lemmatization


## Lemmatization vs Stemming

In [12]:
stemmer2 = SnowballStemmer(language='english')
stemmed_sentence = [stemmer2.stem(token) for token in token_list]
print(s)
print(' '.join(stemmed_sentence))

We are putting in efforts to enhance our understanding of Lemmatization
we are put in effort to enhanc our understand of lemmat


# spaCy Lemmatizer

In [13]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp("We are putting in efforts to enhance our understanding of Lemmatization")
" ".join([token.lemma_ for token in doc])

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

# Stopwords

In [14]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop = set(stopwords.words('english'))
", ".join(stop)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anamini\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


"on, to, so, you, those, by, couldn't, him, themselves, yours, was, wouldn't, t, these, don, it, which, her, didn't, were, the, wouldn, does, won, of, more, before, some, other, y, is, i, who, ll, theirs, d, up, doing, for, a, few, you'll, most, where, shan, have, s, ve, it's, against, an, them, each, until, over, we, you're, there, as, and, can, after, ma, just, that'll, when, wasn, they, now, in, needn, couldn, why, what, shouldn, weren't, ours, won't, further, doesn, from, their, own, or, my, this, had, been, should, because, with, are, no, doesn't, that, did, mustn't, above, hadn, your, be, off, both, aren't, once, if, he, mightn, shouldn't, again, between, wasn't, she, aren, not, during, yourselves, hers, into, our, itself, m, weren, his, mustn, you'd, haven't, how, ain, haven, myself, down, you've, its, she's, shan't, isn, through, didn, am, while, than, only, at, has, out, should've, me, then, herself, don't, will, himself, here, very, all, re, do, having, such, about, o, hasn't

In [15]:
wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']

stop = set(stopwords.words('english'))

sentence = "how are we putting in efforts to enhance our understanding of Lemmatization"

for word in wh_words:
    stop.remove(word)

sentence_after_stopword_removal = [token for token in sentence.split() if token not in stop]
" ".join(sentence_after_stopword_removal)

'how putting efforts enhance understanding Lemmatization'

# Case Folding

Another strategy that helps with normalization is called case folding. As part of case folding, all the letters in the text corpus are converted to lowercase. The and the will be treated the same in a scenario of case folding, whereas they would be treated differently in a non-case folding scenario. This technique helps systems that deal with information retrieval, such as search engines.

In [16]:
s = "We are putting in efforts to enhance our understanding of Lemmatization"
s = s.lower()
s

'we are putting in efforts to enhance our understanding of lemmatization'

# N-grams

In [17]:
from nltk.util import ngrams

s = "Natural Language Processing is the way to go"
tokens = s.split()
bigrams = list(ngrams(tokens, 2))
[" ".join(token) for token in bigrams]

['Natural Language',
 'Language Processing',
 'Processing is',
 'is the',
 'the way',
 'way to',
 'to go']

In [18]:
s = "Natural Language Processing is the way to go"
tokens = s.split()
trigrams = list(ngrams(tokens, 3))
[" ".join(token) for token in trigrams]

['Natural Language Processing',
 'Language Processing is',
 'Processing is the',
 'is the way',
 'the way to',
 'way to go']

# Building a basic vocabulary

In [19]:
s = "Natural Language Processing is the way to go"
tokens = set(s.split())
vocabulary = sorted(tokens)
vocabulary

['Language', 'Natural', 'Processing', 'go', 'is', 'the', 'to', 'way']

# Removing HTML Tags

In [20]:
html = "<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>"
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

My First HeadingMy first paragraph.
