# Tokenization, Tagging, Chunking - Part of Speech Tagging

In [1]:
import nltk

A part of speech tagger will identify the part of speech for a sequence of words.

In [2]:
text = "I walked to the cafe to buy coffee before work."

In [3]:
# tokenize the text
tokens = nltk.word_tokenize(text)

In [4]:
# Part of speech (pos) is tagged to find out part of speech
nltk.pos_tag(tokens)
# pos to every word in the text

[('I', 'PRP'),
 ('walked', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('cafe', 'NN'),
 ('to', 'TO'),
 ('buy', 'VB'),
 ('coffee', 'NN'),
 ('before', 'IN'),
 ('work', 'NN'),
 ('.', '.')]

Do NOT know what is PRP,TO,NN etc. They are called `part of speech key`.

In [5]:
# keys or the abbreviation and their meaning
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

**Desert has 2 different parts of speech and ntlk.pos_tag correctly recognized them**

In [6]:
nltk.pos_tag(nltk.word_tokenize("I will have desert."))

[('I', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('desert', 'NN'), ('.', '.')]

In [7]:
nltk.pos_tag(nltk.word_tokenize("They will desert us."))

[('They', 'PRP'), ('will', 'MD'), ('desert', 'VB'), ('us', 'PRP'), ('.', '.')]

### **Create a list of `all ` nouns** from the text

In [8]:
md = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
md[:9]

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY']

In [9]:
# Tokenize and then normalize all the tokens 
# (gutenberg texts are already tokenized)
md_norm = [word.lower() for word in md if word.isalpha()]
# .isalpha() will only consider words not punctuations

Tagger is very descriptive when it tages parts of speech. Here we are NOT interested to look at types of noun, rather we are looking into all types of nouns.

We can do this by passing tags that equals **universal**. This way we will only look for all nouns rather than a specific kind of noun. Universal will find all types of noun.

In [10]:
md_tags = nltk.pos_tag(md_norm,tagset="universal")

In [13]:
md_tags[:5]

[('moby', 'NOUN'),
 ('dick', 'NOUN'),
 ('by', 'ADP'),
 ('herman', 'NOUN'),
 ('melville', 'NOUN')]

Here we see pos as ONLY noun, rather than different types of noun like last example

In [14]:
md_tags[1]

('dick', 'NOUN')

In [15]:
# word
md_tags[1][0]

'dick'

In [16]:
# part of speech of the word
md_tags[1][1]

'NOUN'

In [17]:
md_nouns = [word[0] for word in md_tags if word[1] == "NOUN"]
md_nouns[:5]

['moby', 'dick', 'herman', 'melville', 'etymology']

**Lets find out frequency distribution of most common noun as we have list of nouns now**

In [18]:
nouns_fd = nltk.FreqDist(md_nouns)

In [19]:
nouns_fd.most_common()[:10]  

[('i', 1182),
 ('whale', 909),
 ('s', 774),
 ('man', 527),
 ('ship', 498),
 ('sea', 435),
 ('head', 337),
 ('time', 334),
 ('boat', 332),
 ('ahab', 278)]

These are the most frequent nouns and are very descriptive that can accurately tell us what the story of moby dick is all about