- With basic word tokenization, each variation of a word will have its own vector component: "walk", "walking", "walks", "walked", ...

- Note that "walk" is no closer to "walking" than it is to "cartwheel" (unless similarity can be learned from the data).

- This leads to high-dimensional vectors in tokenization.

- Practical issue: imagine we're building a search engine, and we wish to search for "running". Then, results that are associated with "run" or "ran" are also clearly associated with the intended search. However, the issue is that with basic word tokenization, these are not matched to the searched word "running".

- Solution: convert words to their root word $\, \Rightarrow \,$ Stemming & lemmatization

<h3>Stemming</h3>

- Stemming is a very crude method, it just chops off the end of the word.
- The result is not necessarily a real word.

In [1]:
from nltk.stem import PorterStemmer

In [2]:
porter = PorterStemmer()

In [4]:
porter.stem('walking')

'walk'

In [5]:
porter.stem('walk')

'walk'

In [9]:
porter.stem('Running')

'run'

In [10]:
porter.stem('hello I am Marlon')

'hello i am marlon'

In [11]:
porter.stem('replacement')

'replac'

- Not a real word!

<h3>Lemmatization</h3>

- Lemmatization is more sophisticated, using actual rules of language.
- The true root word will be returned.
- Can be thought of as a *lookup table* / *table of rules*.
- Note that with lemmatization, we cannot manipulate the converter to become better / worse, it simply comes from a database of words.

In [6]:
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

In [7]:
# Download the word database (the "lookup table")
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/herrakaava/nltk_data...


True

In [8]:
lemmatizer = WordNetLemmatizer()

In [20]:
print(lemmatizer.lemmatize('mice'))
print(porter.stem('mice'))

mouse
mice


In [23]:
print(lemmatizer.lemmatize('going', pos=wordnet.VERB))
print(porter.stem('go'))

go
go


- `pos` = The Part Of Speech tag. 
- Valid options are `"n"` for nouns, `"v"` for verbs, `"a"` for adjectives, `"r"` for adverbs and `"s"` for satellite adjectives.
- This means that in order to properly use the lemmatizer, we should first fo POS tagging.

In [56]:
lemmatizer.lemmatize('better')

'better'

In [61]:
lemmatizer.lemmatize('better', pos='a')

'good'

In [62]:
s = "Swimming is way more fun than running"

In [65]:
for token in s.split():
    print(porter.stem(token), end=' ')

swim is way more fun than run 

In [67]:
for token in s.split():
    print(lemmatizer.lemmatize(token), end=' ')

Swimming is way more fun than running 

In [66]:
for token in s.split():
    print(lemmatizer.lemmatize(token, pos=wordnet.VERB), end=' ')

Swimming be way more fun than run 

We can't manually enter the correct tag for each word (e.g., 'a' for adjectives, 'n' for nouns, etc.). This needs to be handled automatically.

In [68]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [74]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/herrakaava/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [75]:
s2 = "How is the stock market doing today".split()

In [76]:
words_and_tags = nltk.pos_tag(s2)

In [77]:
words_and_tags

[('How', 'WRB'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('stock', 'NN'),
 ('market', 'NN'),
 ('doing', 'VBG'),
 ('today', 'NN')]

In [78]:
for word, tag in words_and_tags:
    lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
    print(lemma, end=' ')

How be the stock market do today 