## Categorizing and Tagging Words

Back in elementary school you learnt the difference between nouns, verbs, adjectives, and adverbs. These are  very useful categories for many language processing tasks. Our goals chapter is to answer the following questions:

1. What are lexical categories and how are they used in natural language processing?
2. What is a good Python data structure for storing words and their categories?
3. How can we automatically tag each word of a text with its word class?

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. 


### Using a POS tagger

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word:
This is very important for trying to extract meaning from text. We often  need to find out the WHAT, WHERE, WHO and HOW in a document, or determine the sentiment of a document. The NLTK (Natural Language Tool Kit) library is one of a number of systems that we can use to understand text. Here are some examples:


In [None]:
# load the toolkit
import nltk

Use the toolkit to tokenize (parse)some text into words, and then label the words with their parts of speech.

In [None]:
# tokenize the text
text = nltk.word_tokenize("And now for something completely different")
# Show the parts of speech for each word
nltk.pos_tag(text)

OK, but what do 'CC', 'RB', 'IN', mean? Here we see that AND  is a CC, a coordinating conjunction; NOW and COMPLETELY are RB, or adverbs; FOR is an IN, a preposition; SOMETHING is NN, a noun; and DIFFERENT is JJ, an adjective.

NLTK provides documentation for each tag, which can be queried using the tag, e.g. `nltk.help.upenn_tagset('RB')`, or a regular expression, e.g. `nltk.help.upenn_tagset('NN.*')`.

In [None]:
# nltk.download()



If the next command doesn't work, type nltk.download()
and download the 'book' grammer, by typing'd' and then 'book'


In [None]:
# ASK NLTK what a   JJ is, and some examples

nltk.help.upenn_tagset('JJ')

### TAGSET meanings for the UPENN  (default) tagset.
Display all of the possible POS tags and examples.

In [None]:
nltk.help.upenn_tagset()

### Representing Tagged Tokens

By convention in NLTK, a tagged token is represented using a **tuple** consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():

Note how NLTK treats (disambiguates)  the two occurences of the token "refuse" in the sentence below.

In [None]:
tokens = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
tagged = nltk.pos_tag(tokens)
tagged

We can index into the "tagged" tuple and retrieve the first element.

In [None]:
tagged_token = tagged[0]
print(tagged_token)
print(tagged_token[0])
print(tagged_token[1])

Now lets iterate through the tagged tuples and break out the token and the POS Tag.


In [None]:
# print the original text, tokenized
print("Text = ", tokens)
# Now the same from the tagged tuples (note the list comprehension)
tokens = [a for (a, b) in tagged]
print("Tokens = ",tokens)
# and then print the POS TAGs
tags = [b for (a, b) in tagged]
print("POS Tags = ", tags)

#### Exercise 

Load a text of your choice, tokenize it, and perform part of speech tagging on it. Then extract the nouns from the text, and perform a frequency anaysis, to identify the most common nouns in the text. (Warning: POS tagging takes a good amount of time when processing long texts, so try to select a text with less than 10K tokens, or simply perform POS tagging on the first 10K-20K tokens).

Repeat the exercise for adjectives.

PS: If you want to parse text from HTML without resorting to XPath expressions, you can use the "BeautifulSoup" library as follows:

In [None]:
from bs4 import BeautifulSoup
import requests

url = "https://www.nytimes.com/2017/05/22/world/europe/ariana-grande-manchester-police.html"
#url="https://www.nytimes.com/2017/09/01/business/economy/jobs-report-unemployment.html"
resp = requests.get(url)
html = resp.text 
raw = BeautifulSoup(html, "lxml").get_text()

# The code below is to remove the junk that was extracted in addition to the article
start = raw.index("MANCHESTER, England —")
#start = raw.index("MACUNGIE")
end = raw.index("Rory Smith reported from Manchester, and Sewell Chan from London")
#end = raw.index("Clifford Krauss and Bill Vlasic contributed reporting.")
raw = raw[start:end]

# Let's do the NLTK stuff
tokens = nltk.word_tokenize(raw)
tagged = nltk.pos_tag(tokens)

In [None]:
tagged

In [None]:
# Get the nouns from the text
nouns = [token for (token,tag) in tagged if  tag.startswith('NN') and token.isalpha()]
fd_nyt = nltk.FreqDist(nouns)
fd_nyt.most_common(20)

In [None]:
# Get the adjectives from the text
adjectives = [token for (token,tag) in tagged if  tag.startswith('JJ')  and token.isalpha()]
fd_nyt = nltk.FreqDist(adjectives)
fd_nyt.most_common(20)