# Stemming / Lemmatizing / Part of Speech

In [27]:
import nltk
import nlp_utilities as nlp

Read: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

These are operations that refer to finding a "root" word form.

The goal is to merge data items that are the same at some "root" meaning level, and reduce the number of features in your data set.  "Cats" and "Cat" might be treated as the same thing, from a topic or summarization perspective.  And with verbs, you want to reduce things with tense or aspect to the same root form:

* *wanted*
* *wants*
* *wanting*
* *want*  == all of them should be the same in terms of counts for meaning.

Stemming removes affixes.  An affix can be considered a piece of a word, "affixed" to the word, that does things like make it plural (e.g., "s").  Porter is the default choice for stemming although other algorithms exist ("Snowball" is one).

In [4]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('wants')

'want'

Lemmatizing transforms to root words using grammar rules. It is slower. Stemming is more common in text analytics or NLP because it takes less time to run.

In [8]:

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('was', pos='v')  # if you don't specify POS, you get zilch.

'be'

In [9]:
lemmatizer.lemmatize('cookbooks')

'cookbook'

In [6]:
lemmatizer.lemmatize('wicked', pos="v")

'wicked'

In [7]:
lemmatizer.lemmatize("were", pos="v")  # lemmatizing would allow us to collapse all forms of "be" into one token

'be'

In [12]:
lemmatizer.lemmatize("buses")

'bus'

In [10]:
# an apparently recommended compression recipe in Perkins Python 3 NLTK book? Not sure I agree.
stemmer.stem(lemmatizer.lemmatize('buses'))

'bu'

How would you use the lemmatizer or stemmer in the word counting code we already wrote? Hint: You will do another list comprehension that looks up the stem or lemma for every token.

# Parts of Speech - Abbreviated "POS"

To do this part, you need to make sure your nltk_data has the the MaxEnt Treebank POS tagger -- you can get it interactively with nltk.download() (on the models tab).

In [13]:
import nltk
import string

### The output of POS tagging is tuples: pairs of the word and the part of speech label.

In [14]:
text = nltk.word_tokenize("And now I present your cat with something completely different.")
tagged = nltk.pos_tag(text)  # there are a few options for taggers, details in NLTK books
tagged

[('And', 'CC'),
 ('now', 'RB'),
 ('I', 'PRP'),
 ('present', 'VBP'),
 ('your', 'PRP$'),
 ('cat', 'NN'),
 ('with', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ'),
 ('.', '.')]

In [23]:
nltk.pos_tag(["après"])

[('après', 'NN')]

In [12]:
# If you need to download the tagger model:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Find the open Python dialog, and download the MaxEnt Treebank POS tagger from Models.

The Penn Treebank part of speech tags are these:
![POS](assets/TreebankPOSTags.png)

In [25]:
nltk.pos_tag(["FBI"])

[('FBI', 'NNP')]

How would we collect just one part of speech in a text, like verbs?  By using our Python list skills...

In [28]:
tokens = nlp.tokenize_text("data/books/Austen_Emma.txt")

In [29]:
tokens[0:10]

['EMMA',
 'BY',
 'JANE',
 'AUSTEN',
 'VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse']

In [30]:
tagged = nltk.pos_tag(tokens)

In [31]:
tagged[0:10]

[('EMMA', 'NN'),
 ('BY', 'NNP'),
 ('JANE', 'NNP'),
 ('AUSTEN', 'NNP'),
 ('VOLUME', 'NNP'),
 ('I', 'PRP'),
 ('CHAPTER', 'VBP'),
 ('I', 'PRP'),
 ('Emma', 'NNP'),
 ('Woodhouse', 'NNP')]

In [32]:
nltk.pos_tag(["Emma"])

[('Emma', 'NN')]

These are tuples, so you have to filter the list for items with the second argument starting with "VB." You'll do this for your homework.

In [33]:
# if you want to remove the tags and get the tokens back, you can use "untag":
nltk.untag(tagged)

['EMMA',
 'BY',
 'JANE',
 'AUSTEN',
 'VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a',
 'comfortable',
 'home',
 'and',
 'happy',
 'disposition',
 ',',
 'seemed',
 'to',
 'unite',
 'some',
 'of',
 'the',
 'best',
 'blessings',
 'of',
 'existence',
 ';',
 'and',
 'had',
 'lived',
 'nearly',
 'twenty-one',
 'years',
 'in',
 'the',
 'world',
 'with',
 'very',
 'little',
 'to',
 'distress',
 'or',
 'vex',
 'her',
 '.',
 'She',
 'was',
 'the',
 'youngest',
 'of',
 'the',
 'two',
 'daughters',
 'of',
 'a',
 'most',
 'affectionate',
 ',',
 'indulgent',
 'father',
 ';',
 'and',
 'had',
 ',',
 'in',
 'consequence',
 'of',
 'her',
 'sister',
 "'s",
 'marriage',
 ',',
 'been',
 'mistress',
 'of',
 'his',
 'house',
 'from',
 'a',
 'very',
 'early',
 'period',
 '.',
 'Her',
 'mother',
 'had',
 'died',
 'too',
 'long',
 'ago',
 'for',
 'her',
 'to',
 'have',
 'more',
 'than',
 'an',
 'indistinct',
 'remembrance',


source: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Parts of speech are used in anaysis that's "deeper" than bags-of-words approaches.  For instance, chunking (parsing for structure) may be used for entity identification and semantics.  See http://www.nltk.org/book/ch07.html for a little more info, and the 2 Perkins NLTK books.

Note also that "real linguists" parse a sentence into a syntactic structure, which is usually a tree form.

![tree](assets/sentence_tree.png)

([Source](http://media.openonline.com.cn/media_file/rm/dongshi2004/yyyyxgl/CHAPTER5/CH5S4E.htm))

For instance, try out the Stanford NLP parser visually at http://corenlp.run/.

